Empirical Baseline
In the following discussion, we assume that topic collections are the
main objective, either as objects in themselves (e.g. for continued reference
in a library ), as a basis for undefined questions, as audit trails (e.g.
due diligence), to understand patterns of knowledge (e.g. influences of
a specific research programs), or to observe characteristics of web materials
to develop better tools, techniques, and training. That is, we contrast
our scope with that of the ad hoc query posted to a search engine, with
the web page author, and with the general Internet technology developer.
The following data is not statistically valid, nor is it possible to
validate the data since the notion of a subject's exact content accessible
on the web at a given point in time is not precisely definable. Rather
we attempt to define and illustrate with experimental collections guidelines.
Quantity - How much is on the web about topic X?
Given topics vaguely characterized like "Therac-25" (a software
safety case study) "FMEA" (Failure Mode and Effects Analysis,
an engineering technique), "design patterns" (the common procedure
representation) ,"security policy" "access control list"
"digital library" (the subject of our experiment), "information
foraging" (a major related research area), XX (well-known computer
scientist), we often want to know:
- How many URLs are known on this subject to search engines?
- How many web pages at those URLs are actually available?
- How much content (MB, words, concepts) is there?
- How many search engine results are actually on target?
- Are some URLs referred to more often than others? Which sites contribute
the most URLs?
- How many URLs are worth keeping on a list to be used for serious research?
Our composite answer is:
- Roughly 1000-3000 URLs will be delivered from queries to search engines,
with very little overlap among results (because search engines index different
parts of the web, different parts of sites, different parts of pages, and
they visit web pages at different intervals). Further inspection will likely
show that one or two search engines will have indexed the subject particularly
well or have retrieval and relevance mechanisms well suited to the query.
- 10-20% of the URLs will be dead or changed. Newsletters and job postings
change frequently by nature and companies and organizations shifting servers
is common compounded by infrequent search engine web crawling patterns.
- 30-50MB of web content is common, with a few large (1MB) files (often
from government agencies, dumped databases, or online thesauri). A commercial
indexing package we use, Iconovex SWAPI, displays often 30,000 meaningful
phrases (person and place names and "concepts").
- Most URL lists, after removing dead links, will reduce by 1/2 when
common professional criteria are applied to define subject relevance and
acceptability. Such criteria include: reducing redundancy, establishing
authenticity, content worth looking at again, utility of the material,
intelligibility, added value to collection, etc.
- Most URLs are linked to by at most one other page, usually from their
own site. Clearinghouse and general subject classifications are often widely
linked , while major textbooks and influential individuals are often referred
to but without linking. In other words, the web isn't much of a real web,
but rather a cluster of sites linked to primarily by and through search
engines.
Now consider the requirements for tools to permit a web professional
to process a subject on the order of hours given this type of characterization:
- Metasearchers to probe multiple engines and collate results. Note that
it's not, unfortunately, "one query fits all" so metasearchers
will introduce additional bogus hits beyond what search engines normally
deliver. Alternatively, the user can query several engines directly, saving
the multiple pages delivered (10 URLs per page by Altavista through 100
by Hotbot) up to search engine limits (e.g. 500 for Infoseek), but this
is extremely time-consuming and subject to loss (naming pages or finding
them in the browser's cache). But to adequately collect pages on a subject,
multiple search engines must be used.
- Bandwidth, disk storage and compression to manipulate files. Meeting
these limits is not difficult with current technology, but nevertheless
a new process (workflow) is introduced into the web IP's toolset -- keeping
track of large, time-intensive collections of files, changes to them, and
directories.
- Tools for scanning such large amounts of materials, both in abbreviated
form and as full documents, to determine what is on target and to ascertain
quality.
- User skills for making sufficiently rapid decisions as to relevance
and quality, means of recording and executing decisions.
- Ways of reporting results to users next in line, e.g. library patrons,
due diligence auditors.
Appendix A contains an earlier
chart of data from studies conducted prior to March 1997. Studies performed
since then indicate several changes:
- A more powerful, if sometimes erratic, metasearcher, WebFerret Pro,
provides more URLs than we had been getting with other metasearchers or
directly querying and clicking down the pages. It's likely there are twice
as many URLs as quoted in these tables, although many of the additional
ones are accounted for my duplication (mirror sites on the web) and variations
of URLs.
- Several studies have shown that key people are good indicators of the
size and scope of a subject. Web-active individuals will typically refer
to all the conferences they attend, the papers they present, their organizational
and work affiliations, and other researchers in the field. Here are approximate
numbers of URLs on several individuals:
- David Eichmann, university professor, NASA technology transfer
project leader, and web researcher - 1000+ URLs, 1/2 project reports, 1/2
references and external papers
- Greg Notess, reference librarian, writer and columnist for Online magazines,
500+ URLs, mostly self-posted, but many references to tutorial materials
from other libraries
- Anita Jones, former DoD Director of Research and Engineering,
500 URLs, mostly technology policy and public statements
- Harlan Mills, industrial research leader, founding new approaches to
software engineering, 400 URLs, on case studies and technology transfer
of his approach
- From our survey,"security policy" + "access control
lists" + "digital libraries", roughly 1500 URLs, probably
upward of 2000 for security-related pages.
Why is this information significant? If the rule "500-1500 per
specific topic" holds, then we know
- better how much data our tools must handle, a key issue for visualization
tools
- Web searchers must not stop with single search engines if they expect
thorough coverage of a subject (although often their objective is a specific
reference or answer)
- some growth will occur but it's likely that for many topics much of
the material is already online
For the purposes of experimenting with "browsing in context",
this tells us that, generally, we've validated that we're in a worthwhile
range of exploration:
- there is sufficient material on most subjects to warrant multiple views.
- there is not so much material available that we cannot begin experiments
with our existing tool infrastructure although we might expect to stress
these or outgrow them soon.
- multiple contexts will exist and be interesting.
- the amount is beyond what normal users would attempt by hand using
existing browsers, word processing, and other tools
Quality - What should go into a good topic collection?
While quantitative data is relatively easy to come by, how can we address
any baseline for quality? So far, we've performed two exercises in comparative
analysis of collections.
For FMEA, we worked with a graduate computer science student on her
independent study project on software safety using Nancy Leveson's Safeware
book. She applied her background in civil engineering to identify key high-content
resources on FMEA. Working with only the reports, she expressed considerable
frustration at getting to most sites on the web (even at 2 a.m.) but worked
better with the downloaded files. She assessed about 10 pages as being
highly valuable and interesting, including some Stanford technical reports,
European Esprit project descriptions, and a few product-related discussions.
In another earlier project, we had collected and then assessed several
hundred pages on Cocomo II, a software cost model, for a colleague
selling a COCOMO implementation. My assessment of the value of a link to
put on his website, as well as identification of competitors, agreed with
his on about 85% of the URLs. That we could provide this type of assessment
at all required a process and tools for capturing rating information.
Information Quality is a major discussion topic for many newsgroups,
ranging from website design to quality of information. The best central
sources we've found are two mailing lists: UCSD
Professor Phil Agre's Red Rock Eater List. and
INFORMATION QUALITY
WWW Virtual Library