Appendix A:
CHI 97 Workshop Paper

Browsing In Context

Susan L. Gerhart, ROI Joint Venture slger@netropolis.net

Position Paper for CHI97 Workshop on Augmented Conceptual Analysis of the Web, March 1997, Atlanta GA.

Perspective

Let's say I'm a "web information professional" who regularly needs to collect extensive amounts of technical material on particular web topics for specific client needs, catalogs "on spec", or personal growth. Search engines and pathfinder pages will typically bring in several hundred URLs -- an eclectic mess of diverse content, relative and absolute quality, and accessibility. What techniques do I employ? What tools do I assemble? What conceptual framework do I use to trade-off quantity and quality for specific projects and for evolution of my intellectual property base? Also, I need to turn my search results and polished reports around quickly, both to meet deadlines and to be cost-effective.

Phenomena Observed

Following are several gross empirical claims based on experience with topics such as those in the attached table of data:

  1. Typically, 10-20% of URLs from just about any source (search engines, processed lists, websites) are erroneously constructed, dead, transiently unavailable, etc.
    ==> Large scale topic collection requires specific screening for dead links.
    ==> Internet congestion and unpredictability requires replay and incremental strategies.
  2. Running the same query through multiple search engines, even down to depths of 100, will rarely produce very many overlapping URLs.
    ==> Multiple searchers and meta searchers must be plied and then their results collectively collated. Hits from multiple search engines increases likelihood of relevance (because of different retrieval methods) but appearance in only one does not correlate highly with relevance.
  3. Many topics tend to be concentrated in specific Internet domains (and sites) but most are also spread across almost all domains (even .mil). Likewise most topics have contributions ranging from individuals through organized entities (with supported websites) through multi-institutional associations. Domain and organization units are only two or several dimensions of web contexts.
    ==> browsing by context can reduce confusion and reveal, sometimes surprising, patterns of content.
  4. Rule of thumb: Half the URLs discovered will be keepers, half will be redundant, dull, off-topic, or of very little content. ==> Rapid decision making and filtering must be supported, with minimal overhead in locating, downloading, and managing storage of web materials.

Discussion

Several months of web foraging and increasingly disciplined topic organization have revealed the pros and cons of different ways of working the web. On-line surfing is needed and useful only to access initial topic pointers, (typically search engines, clearinghouses, and personal knowledge) and then later follow up leads. Not only is one-URL-at-a-time click-and-wait mind-numbing but also offers the temptation and distraction from the goal of the topic collection. Downloaders such as (Windows 95 products) Surfbot, NetAttache Pro, WebWhacker/WebSeeker, EchoSearch, and our twURL, etc. can collect the initial content for offline browsing (depending of course, on open representation of downloaded "data bases").

But that's only the start. Recording impressions, classifying URLs, and generating reports are the new tasks that absorb time. List editors, bookmark managers, and other browser accessories are required to manage these tasks. Just as "office suites" of 5 years ago formed from word processing, spell checking, drawing tools, and macro languages, new suites are forming to master these time-consuming tasks. Our experience building twURL is that (1) providing a wide array of recording mechanisms can significantly increase the user's ability to concentrate on reading and assessing content and (2) an embedded, programmable browser, even one moderately-impaired wrt modern multimedia and HTML, provides a significant jump in capability. A recent PC magazine "Abort-Retry-Fail" cartoon shows Ross Perot on Larry King Live saying "enough about the election, let's talk browsers. The two big ones aren't reflecting the needs of users. It's time for a 'reform browser'." Well, maybe the cartoon Perot was right about one thing - the browser warriors might be missing out on the essence of the problem from a content-analyzer's perspective.

Assuming the mechanical and logistic problems were (being) solved, how would this topic collector and organizer and analyst get down to the heart of the problem? Our experience is that a good set of user-defined multi-dimensional classification schemes tailored to the specific analysis goal and topic can provide the focus and decision-making. For example, most URLs, i.e. their pages, will fit into the following "genre" classifications:

Organizational unit:
individual, project, division, institution, association
Utility:
Method, tool, process, evaluation, theory
Stakeholder:
User, vendor, consultant, educator/student, researcher, ...
Perspective:
Fact, issue/controversy, exploration, exposition, ...
Strength (relative to goals):
great, good, OK, weak
Disposal:
dead, dull, dumb, different
Process:
requirements, design, implementation, validation, documentation, education, maintenance, ...

Now, assuming that (1) downloading, browsing, recording, and reporting are small constant parts of the processing for each URL and that (2) classification schemes such as the above provide frameworks for quality discrimination, as well as useful tailoring for end users of the topic collections, what is the most significant part of the decision-making operation and determiner of how long it takes? Consider the task of making a one-minute assessment of a page sitting in a browser. The decision may be a first cut requiring little reading (e.g. a directory listing or an "under construction") or it may require some reading, scrolling, and thinking. If there is no sequence to the URL progression (perhaps 500 or so to be processed), then the overhead of changing contexts is enormous for most of the diverse web collections - one minute a grad student bookmark list, the next a major research luminary's technical report, then a tools vendor, a conference, etc. The more ways the material can be clustered to provide continuity to the decision-maker's thinking, i.e. knowing that this URL fits into some previous established region of the web, if only .com Vs. .edu Vs. .mil or assurance of multiple search engine hits, then a major amount of context-switching is reduced. Not only can the analyst remain sane while working at stretches of an hour or so, but the resulting classification may be more consistent and reproducible.

Thus, our take on one of the major needs of web users is for better "browsing in context", not just to perform the above challenging analysis tasks but also for regular browsing. Somewhere along the line, browsers (the big ones needing 'reform') turned into motion picture theaters rather than hypertext navigators. Nothing wrong with the former, but many users (a few million or so) most likely need better tools for exploiting the WWW as a vast technical library and personal growth environment.

Trends and Needs

Several technology and infrastructure improvements are needed:

  1. Some way of marking and disposing of once-useful, maybe still useful, but no longer timely content, e.g. past newsletters, conference announcements, course offerings, etc. Not just relocatable URLs, but more graceful conventions for archiving and tracking past events, e.g. differences among course offerings, or follow-up publications from conferences.
  2. Automatic categorizations or self-classification of organizational units. Knowing whether you're going to a some sheep doctor's home page or the Human Genome project may matter (or maybe not). The eclectic nature of the web provides both the delight of discovery and learning and frequent unnecessary overhead.
  3. Usability engineering of the above-mentioned browser accessories (or complements or energizers) to help design the suites that in the next few years will enable modern researchers tools to match the incomparable expansion of knowledge access through the WWW.

Data

Following is a table of results from several topic collections using our twURL tool - a kind of Swiss Army knife with downloader, outline organizer, editor, report generator, browser, rating recorder, etc. The data shows some of the scale of (still experimental) topic collections, e.g. a filtered 250 up to unfiltered 1250.

The first column describes the topic, how it was collected, and degrees of processing (ongoing). The second column shows the associated domain classification, where "Other" is all international, numeric, and often erroneous URLs. Numbers of form {X Y} indicate X URLs in that class with Y links from the source of the URLs (either saved or cached search engine pages, downloaded web pages, etc.). Site sizes are shown as X [Y] where Y is the number of sites with Y URLs. Note that these collections typically start out with loads of "search engine junk", including ads, self-refineries, and help.

The 3rd and 4th columns are derived analyses classifying the URLs by # of links, e.g. hit by more than one search engine or other web pages (sometimes multiple links from the same page) and most populous sites in the URL pool. In the case of the COCOMO topic, where every URL was viewed and rated using the one-minute classification regime, also shown is the distribution of URLs across search engines used.

More examples of these topics and discussions of some of the phenomena are included on our website [www.roir.com]. To highlight just a bit of what we learned on each topic,

twURL data from several searches

Usability Engineering
Technical, Search engines, Unfiltered(only ads)
com { 108 295 } 
edu { 67 181 } 
gov { 11 40 } 
org { 40 102 } 
net { 13 33 } 
mil { 4 9 } 
other { 163 415 }
•5 { 2 12 } 
•4 { 27 134 }
•3 { 78 298 } 
•2 { 290 614 } 
•1 { 8 9 } 
•23 [1] org  ^acm 
•21 [1] other ^uk ^ac 
•13 [2] com ^useit
other ^fi ^hut  
•10 [1] gov ^nist 
•9 [6]
Therac 25
Software safety case, Search engines, Filtered, annotated, analysed
com { 25 54 } 
edu { 80 186 } 
gov { 4 13 } 
org { 15 37 } 
net { 2 3 } 
mil { 1 1 } 
other { 56 126 }
1 { 71 71 } 
2 { 51 102 } 
3 { 29 87 } 
4 { 14 56 } 
5 { 9 45 } 
6 { 2 12 } 
7 { 2 14 } 
8 { 3 24 } 
9 { 1 9 } 
•1 [90] 
•2 [10] 
•3 [8] 
•4 [1] 
•5 [1] 
•16 [1] 
•24 [1] 
Year 2000 Date "Crisis"
Vendor lists (August 1996), Filtered programming research
com { 274 553 } 
edu { 4 4 } 
gov { 2 2 } 
org { 2 2 } 
net { 3 4 } 
mil { 2 2 } 
other { 22 31 
5 { 3 17 } 
4 { 60 289 } 
3 { 13 39 }
2 { 1 2 } 
1 { 232 251 } 
93 [1] 
19 [1] 17 [1] 
13 [1] 11 [3] 
10 [2] 9 [2] 
8 [1] 7 [4] 
6 [1] 2 [5] 
1 [44] 
Evita (the movie) (really Madonna)
Entertainment, Search engines, Lightly filtered.
com { 68 84 } 
edu { 17 21 } 
gov { 0 0 } 
org { 0 0 } 
net { 6 8 } 
mil { 0 0 } 
other { 34 39 }
•2 { 19 38 } 
•3 { 1 3 } 
•4 { 2 8 } 
•1 [41] 
•2 [6] •3 [6] 
•4 [3] •5 [2] 
•6 [1] •7 [2] 
•12 [1] 
COCOMO
Software estimation tool, Search engines, Fully filtered and rated (COCOMO product )
com { 38 48 } 
edu { 128 184 } 
gov { 23 29 } 
org { 8 10 } 
net { 4 4 } 
mil { 9 12 } 
other { 96 117 } 
•infoseek { 59 107 } 
•metacrawler { 84 141 } 
•webseeker { 192 274 } 
•owwwl { 50 65 } 
•1 { 222 222 } 
•2 { 72 144 } 
•3 { 10 30 } 
•4 { 2 8 } 
•1 [65] 
•2 [23] 
•3 [9] •4 [4] 
•5 [5] •6 [2] 
•7 [1] •9 [2] 
•10 [2]
•23 [1] 
•47 [1] 
Software researchers' website
Virtual Library, newsletter, pubs, Website download, Not filtered
com { 16 81 } 
edu { 163 360 } 
gov { 23 105 } 
org { 3 14 } 
net { 0 0 } 
mil { 2 6 } 
other { 72 294 } 
•6 { 3 24 }
•4 { 5 32 } 
•3 { 116 593 } 
•2 { 56 112 } 
•1 { 99 99 } 
•125 [1] 
•22 [1] •19 [1] 
•13 [1] •6 [1] 
•5 [2] •4 [4] •3 [2] 
•2 [7] •1 [48] 
FMEA
Safety technique, Search engines, Unfiltered, (scatter-gather experiment)
com { 372 618 } 
edu { 174 266 } 
gov { 69 100 } 
org { 61 78 } 
net { 29 41 } 
mil { 11 15 } 
other { 488 718 } 
•33 { 1 33 } 
•24 { 1 24 } 
•6 { 1 6 } 
•5 { 4 21 } 
•4 { 18 76 } 
•3 { 67 207 } 
•2 { 231 488 } 
•1 { 881 981 } 
•71 [1] •47 [1] 
•34 [1] •33 [1] •32 [1] 
•25 [1] •20 [1] •19 [1] 
•18 [1] •17 [1] •12 [3] 
•11 [3] •10 [2] 
•8 [4] •7 [4] •6 [9] 
•5 [18] •4 [22] 
•3 [34] •2 [74] •1 [257]