Empirical Baseline

In the following discussion, we assume that topic collections are the main objective, either as objects in themselves (e.g. for continued reference in a library ), as a basis for undefined questions, as audit trails (e.g. due diligence), to understand patterns of knowledge (e.g. influences of a specific research programs), or to observe characteristics of web materials to develop better tools, techniques, and training. That is, we contrast our scope with that of the ad hoc query posted to a search engine, with the web page author, and with the general Internet technology developer.

The following data is not statistically valid, nor is it possible to validate the data since the notion of a subject's exact content accessible on the web at a given point in time is not precisely definable. Rather we attempt to define and illustrate with experimental collections guidelines.

Quantity - How much is on the web about topic X?

Given topics vaguely characterized like "Therac-25" (a software safety case study) "FMEA" (Failure Mode and Effects Analysis, an engineering technique), "design patterns" (the common procedure representation) ,"security policy" "access control list" "digital library" (the subject of our experiment), "information foraging" (a major related research area), XX (well-known computer scientist), we often want to know:

Our composite answer is:

Now consider the requirements for tools to permit a web professional to process a subject on the order of hours given this type of characterization:

  1. Metasearchers to probe multiple engines and collate results. Note that it's not, unfortunately, "one query fits all" so metasearchers will introduce additional bogus hits beyond what search engines normally deliver. Alternatively, the user can query several engines directly, saving the multiple pages delivered (10 URLs per page by Altavista through 100 by Hotbot) up to search engine limits (e.g. 500 for Infoseek), but this is extremely time-consuming and subject to loss (naming pages or finding them in the browser's cache). But to adequately collect pages on a subject, multiple search engines must be used.
  2. Bandwidth, disk storage and compression to manipulate files. Meeting these limits is not difficult with current technology, but nevertheless a new process (workflow) is introduced into the web IP's toolset -- keeping track of large, time-intensive collections of files, changes to them, and directories.
  3. Tools for scanning such large amounts of materials, both in abbreviated form and as full documents, to determine what is on target and to ascertain quality.
  4. User skills for making sufficiently rapid decisions as to relevance and quality, means of recording and executing decisions.
  5. Ways of reporting results to users next in line, e.g. library patrons, due diligence auditors.

Appendix A contains an earlier chart of data from studies conducted prior to March 1997. Studies performed since then indicate several changes:

  1. A more powerful, if sometimes erratic, metasearcher, WebFerret Pro, provides more URLs than we had been getting with other metasearchers or directly querying and clicking down the pages. It's likely there are twice as many URLs as quoted in these tables, although many of the additional ones are accounted for my duplication (mirror sites on the web) and variations of URLs.
  2. Several studies have shown that key people are good indicators of the size and scope of a subject. Web-active individuals will typically refer to all the conferences they attend, the papers they present, their organizational and work affiliations, and other researchers in the field. Here are approximate numbers of URLs on several individuals:
  3. From our survey,"security policy" + "access control lists" + "digital libraries", roughly 1500 URLs, probably upward of 2000 for security-related pages.

Why is this information significant? If the rule "500-1500 per specific topic" holds, then we know

For the purposes of experimenting with "browsing in context", this tells us that, generally, we've validated that we're in a worthwhile range of exploration:

Quality - What should go into a good topic collection?

While quantitative data is relatively easy to come by, how can we address any baseline for quality? So far, we've performed two exercises in comparative analysis of collections.

For FMEA, we worked with a graduate computer science student on her independent study project on software safety using Nancy Leveson's Safeware book. She applied her background in civil engineering to identify key high-content resources on FMEA. Working with only the reports, she expressed considerable frustration at getting to most sites on the web (even at 2 a.m.) but worked better with the downloaded files. She assessed about 10 pages as being highly valuable and interesting, including some Stanford technical reports, European Esprit project descriptions, and a few product-related discussions.

In another earlier project, we had collected and then assessed several hundred pages on Cocomo II, a software cost model, for a colleague selling a COCOMO implementation. My assessment of the value of a link to put on his website, as well as identification of competitors, agreed with his on about 85% of the URLs. That we could provide this type of assessment at all required a process and tools for capturing rating information.

Information Quality is a major discussion topic for many newsgroups, ranging from website design to quality of information. The best central sources we've found are two mailing lists: UCSD Professor Phil Agre's Red Rock Eater List. and INFORMATION QUALITY WWW Virtual Library