Susan L. Gerhart, ROI Joint Venture slger@netropolis.net
Position Paper for CHI97 Workshop on Augmented Conceptual Analysis of the Web, March 1997, Atlanta GA.
Let's say I'm a "web information professional" who regularly needs to collect extensive amounts of technical material on particular web topics for specific client needs, catalogs "on spec", or personal growth. Search engines and pathfinder pages will typically bring in several hundred URLs -- an eclectic mess of diverse content, relative and absolute quality, and accessibility. What techniques do I employ? What tools do I assemble? What conceptual framework do I use to trade-off quantity and quality for specific projects and for evolution of my intellectual property base? Also, I need to turn my search results and polished reports around quickly, both to meet deadlines and to be cost-effective.
Following are several gross empirical claims based on experience with topics such as those in the attached table of data:
Several months of web foraging and increasingly disciplined topic organization have revealed the pros and cons of different ways of working the web. On-line surfing is needed and useful only to access initial topic pointers, (typically search engines, clearinghouses, and personal knowledge) and then later follow up leads. Not only is one-URL-at-a-time click-and-wait mind-numbing but also offers the temptation and distraction from the goal of the topic collection. Downloaders such as (Windows 95 products) Surfbot, NetAttache Pro, WebWhacker/WebSeeker, EchoSearch, and our twURL, etc. can collect the initial content for offline browsing (depending of course, on open representation of downloaded "data bases").
But that's only the start. Recording impressions, classifying URLs, and generating reports are the new tasks that absorb time. List editors, bookmark managers, and other browser accessories are required to manage these tasks. Just as "office suites" of 5 years ago formed from word processing, spell checking, drawing tools, and macro languages, new suites are forming to master these time-consuming tasks. Our experience building twURL is that (1) providing a wide array of recording mechanisms can significantly increase the user's ability to concentrate on reading and assessing content and (2) an embedded, programmable browser, even one moderately-impaired wrt modern multimedia and HTML, provides a significant jump in capability. A recent PC magazine "Abort-Retry-Fail" cartoon shows Ross Perot on Larry King Live saying "enough about the election, let's talk browsers. The two big ones aren't reflecting the needs of users. It's time for a 'reform browser'." Well, maybe the cartoon Perot was right about one thing - the browser warriors might be missing out on the essence of the problem from a content-analyzer's perspective.
Assuming the mechanical and logistic problems were (being) solved, how would this topic collector and organizer and analyst get down to the heart of the problem? Our experience is that a good set of user-defined multi-dimensional classification schemes tailored to the specific analysis goal and topic can provide the focus and decision-making. For example, most URLs, i.e. their pages, will fit into the following "genre" classifications:
Now, assuming that (1) downloading, browsing, recording, and reporting are small constant parts of the processing for each URL and that (2) classification schemes such as the above provide frameworks for quality discrimination, as well as useful tailoring for end users of the topic collections, what is the most significant part of the decision-making operation and determiner of how long it takes? Consider the task of making a one-minute assessment of a page sitting in a browser. The decision may be a first cut requiring little reading (e.g. a directory listing or an "under construction") or it may require some reading, scrolling, and thinking. If there is no sequence to the URL progression (perhaps 500 or so to be processed), then the overhead of changing contexts is enormous for most of the diverse web collections - one minute a grad student bookmark list, the next a major research luminary's technical report, then a tools vendor, a conference, etc. The more ways the material can be clustered to provide continuity to the decision-maker's thinking, i.e. knowing that this URL fits into some previous established region of the web, if only .com Vs. .edu Vs. .mil or assurance of multiple search engine hits, then a major amount of context-switching is reduced. Not only can the analyst remain sane while working at stretches of an hour or so, but the resulting classification may be more consistent and reproducible.
Thus, our take on one of the major needs of web users is for better "browsing in context", not just to perform the above challenging analysis tasks but also for regular browsing. Somewhere along the line, browsers (the big ones needing 'reform') turned into motion picture theaters rather than hypertext navigators. Nothing wrong with the former, but many users (a few million or so) most likely need better tools for exploiting the WWW as a vast technical library and personal growth environment.
Several technology and infrastructure improvements are needed:
Following is a table of results from several topic collections using our twURL tool - a kind of Swiss Army knife with downloader, outline organizer, editor, report generator, browser, rating recorder, etc. The data shows some of the scale of (still experimental) topic collections, e.g. a filtered 250 up to unfiltered 1250.
The first column describes the topic, how it was collected, and degrees of processing (ongoing). The second column shows the associated domain classification, where "Other" is all international, numeric, and often erroneous URLs. Numbers of form {X Y} indicate X URLs in that class with Y links from the source of the URLs (either saved or cached search engine pages, downloaded web pages, etc.). Site sizes are shown as X [Y] where Y is the number of sites with Y URLs. Note that these collections typically start out with loads of "search engine junk", including ads, self-refineries, and help.
The 3rd and 4th columns are derived analyses classifying the URLs by # of links, e.g. hit by more than one search engine or other web pages (sometimes multiple links from the same page) and most populous sites in the URL pool. In the case of the COCOMO topic, where every URL was viewed and rated using the one-minute classification regime, also shown is the distribution of URLs across search engines used.
More examples of these topics and discussions of some of the phenomena are included on our website [www.roir.com]. To highlight just a bit of what we learned on each topic,
| Usability Engineering Technical, Search engines, Unfiltered(only ads) |
com { 108 295 }
edu { 67 181 }
gov { 11 40 }
org { 40 102 }
net { 13 33 }
mil { 4 9 }
other { 163 415 }
|
5 { 2 12 }
4 { 27 134 }
3 { 78 298 }
2 { 290 614 }
1 { 8 9 }
|
23 [1] org ^acm 21 [1] other ^uk ^ac 13 [2] com ^useit other ^fi ^hut 10 [1] gov ^nist 9 [6] |
| Therac 25 Software safety case, Search engines, Filtered, annotated, analysed |
com { 25 54 }
edu { 80 186 }
gov { 4 13 }
org { 15 37 }
net { 2 3 }
mil { 1 1 }
other { 56 126 }
|
1 { 71 71 }
2 { 51 102 }
3 { 29 87 }
4 { 14 56 }
5 { 9 45 }
6 { 2 12 }
7 { 2 14 }
8 { 3 24 }
9 { 1 9 }
|
1 [90] 2 [10] 3 [8] 4 [1] 5 [1] 16 [1] 24 [1] |
| Year 2000 Date "Crisis" Vendor lists (August 1996), Filtered programming research |
com { 274 553 }
edu { 4 4 }
gov { 2 2 }
org { 2 2 }
net { 3 4 }
mil { 2 2 }
other { 22 31
|
5 { 3 17 }
4 { 60 289 }
3 { 13 39 }
2 { 1 2 }
1 { 232 251 }
|
93 [1] 19 [1] 17 [1] 13 [1] 11 [3] 10 [2] 9 [2] 8 [1] 7 [4] 6 [1] 2 [5] 1 [44] |
| Evita (the movie) (really Madonna)
Entertainment, Search engines, Lightly filtered. |
com { 68 84 }
edu { 17 21 }
gov { 0 0 }
org { 0 0 }
net { 6 8 }
mil { 0 0 }
other { 34 39 }
|
2 { 19 38 }
3 { 1 3 }
4 { 2 8 }
|
1 [41] 2 [6] 3 [6] 4 [3] 5 [2] 6 [1] 7 [2] 12 [1] |
| COCOMO Software estimation tool, Search engines, Fully filtered and rated (COCOMO product ) |
com { 38 48 }
edu { 128 184 }
gov { 23 29 }
org { 8 10 }
net { 4 4 }
mil { 9 12 }
other { 96 117 }
|
infoseek { 59 107 }
metacrawler { 84 141 }
webseeker { 192 274 }
owwwl { 50 65 }
1 { 222 222 }
2 { 72 144 }
3 { 10 30 }
4 { 2 8 }
|
1 [65] 2 [23] 3 [9] 4 [4] 5 [5] 6 [2] 7 [1] 9 [2] 10 [2] 23 [1] 47 [1] |
| Software researchers' website Virtual Library, newsletter, pubs, Website download, Not filtered |
com { 16 81 }
edu { 163 360 }
gov { 23 105 }
org { 3 14 }
net { 0 0 }
mil { 2 6 }
other { 72 294 }
|
6 { 3 24 }
4 { 5 32 }
3 { 116 593 }
2 { 56 112 }
1 { 99 99 }
|
125 [1] 22 [1] 19 [1] 13 [1] 6 [1] 5 [2] 4 [4] 3 [2] 2 [7] 1 [48] |
| FMEA Safety technique, Search engines, Unfiltered, (scatter-gather experiment) |
com { 372 618 }
edu { 174 266 }
gov { 69 100 }
org { 61 78 }
net { 29 41 }
mil { 11 15 }
other { 488 718 }
|
33 { 1 33 }
24 { 1 24 }
6 { 1 6 }
5 { 4 21 }
4 { 18 76 }
3 { 67 207 }
2 { 231 488 }
1 { 881 981 }
|
71 [1] 47 [1] 34 [1] 33 [1] 32 [1] 25 [1] 20 [1] 19 [1] 18 [1] 17 [1] 12 [3] 11 [3] 10 [2] 8 [4] 7 [4] 6 [9] 5 [18] 4 [22] 3 [34] 2 [74] 1 [257] |