Following are the highlights of some of the papers we reviewed related to BiC. We attempted to cross many fields and sample both methodology and results. We also introduced some new "speed-reading" techniques in the process, by downloading online conference proceedings, indexing the results, and reading the papers inside-out.
"Computational Models of Information Scent-Following in a Very Large Browsable Text Collection" (Pirolli, CHI 97) provides a useful methodology framework - knowledge, adaptation, cognitive, and biological. Our work is partly Knowledge-oriented in trying to understand how tasks are formulated and to define a common set of tasks to evaluate our approach and tools. Adaptation also characterizes the need for understanding and supporting strategies within constraints of environment, tools, and costs. "Browsing in context" is definitely a strategy, to process one by one ordered items so as to (a) avoid frequent context-switching, thus lowering overload and increasing efficiency and (b) produce more consistent and reproducible evaluations.
The model-tracer experimental methodology is applicable to the twURL toolset, where events, user focus, and actions could be logged to develop a model of types and order of major operations. For example, browsing a domain view may start with a domain the user is familiar with or perhaps one most likely to be productive or perhaps based on size or other factors. Some set of probes of the data and some type of decision process could be investigated to determine how domain preference works.
Scatter-gather is the primary clustering mechanism used by the Xerox research team is an example of a strategy for helping the user find useful clusters of large numbers of documents in pursuit of other goals. In one sense, our Domain and Link Count Views provide definite clusters with a fixed results within a given pool. Scatter-gather uses text analysis and document features to determine similarity, providing a flexible way of shifting user attention from one term or feature to another.
"Sensemaker: An Information-Exploration Interface Supporting the Contextual Evolution of a User's Interests" (Baldenodado and Winograde, CHI 97) observes how users progress through information contexts while collecting resources on a topic. Sensemaker provides explicit bundling, duplicate detection, collection expansion, and viewing operations, one of which is URL-bundling, as in our Domain View. Their experiments suggest that indeed users find different view, as in tables of unbundled and different bundlings, useful but also show how difficult it is to design such experiments (what to measure, time to perform) and get useful results. Defined and variable level expansion-reduction operations would be useful for the BiC paradigm.
"Life, Death, and Lawfulness on the Electronic Frontier" (Pitkow and Perilli, CHI 97) bases an investigation on co-citation analysis, that 2 cited articles in the same paper have something in common worth noting. Not surprisingly, they found that the clustering of a university website closely followed its mutli-project orientation with a smattering of individual and common environment clusters. The paper also addresses the important issue of life-cycle of web pages, providing formulae which predict changes and survival based on usage patterns. We have found "links-to" more useful than "links-from" in our analyses. Since we simply extract all links from a web page, it's quite common to mix self-reference and advertising in with the primary results being listed. This is especially true of search engine web results but is also a pattern for more cohesive pages as well. Links-to answers the question: how did this URL get in the pool?
"Conceptual Analysis of the Web" was a workshop of 12 diverse individuals addressing new ways of working with the web. Much of the discussion centered around phenomena - what was happening on the web? how was it different? how could such phenomena be studied? Particularly interesting were the work of
"The Great CHI 97 Browse-Off" was an effort to compare alternative tools for finding specific items within an extensive and diverse hierarchy of subjects, without using text search. The Xerox Perspective View provided a novel way of visually organizing and manipulating the hierarchy and was most effective during the contest. Our toolset, aside from its difficulty with the bulk of the data (over 7 MB), was not particularly well matched to the tasks identified by the organizers. We;re still not sure what the import of this type of browsing is nor how improved tools for it will get anybody's work done faster or better.
Here we report a different way of analyzing the literature of a field, namely by browsing its concept/index.
This is an online record of actions taken on these Proceedings downloaded several months ago then processed into the ConceptBrowser. The goal was to find papers relevant to the subject of this research project report, i.e. relevant to the BiC model. More generally, this was an experiment in using concept-based, bottom-up browsing of the entire proceedings as opposed to top-down table-of-contents selection of papers to read. All the work was performed using the offline browser and links from concepts to papers. A time limit was imposed, about 2 hours to see how many papers and narrower topics would be found. This experiment was also motivated by not having the index of the proceedings immediately available.
Starting from top:
[Well, one-by-one through the A's takes about 30 minutes, including reading 3 papers] Time to take another approach to find the concepts most related, probably under "navigation" or "hyper" but first I'll do a frequency list and see what concepts are most common to this collection of papers. In a way my goal is to "slice" the proceedings, finding the subcollection of relevance.
Another idea, first identify the personal names in the concept list which usually appear as "Last, First" or "Dr. X" or "I.M. Somebody". Our ConceptBrowser. adds this marking to the Iconovex processing for a reasonably accurate list of names. Using names,
Going back to the Table of Contents, I found a highly relevant paper that hadn't been indexed (a constant problem of our web process is managing file sets where the download or component functions may fail).
This sample of papers suggests that (1) information driven approaches to web use are still in their infancy with little data to identify better or worse approaches, (2) more work is related to servers, networks, and multimedia than information content. We found several formalizations of part of our work and ideas for several extensions.
We followed a similar process with several other conferences: Digital Library 95, Hypertext 96, the Allterton 97 Digital Publishing workshop.
Hypertext97 contains a number of papers that address models of documents and document collections. We were unable to access these documents until late in the project but it is clear that new ways of expressing models of navigation are being developed by the hypertext research community. A formal model of "Browsing in Context" could be developed; indeed, the set model underlying LodeStar informally follows that model. There would be several benefits to such a model: (1) better design of the database, user inferface, and editing operations; (2) better explanations in documentation or other tutorials; and (3) a base for more algorithmic analyses.
Using the query "information foraging" on Northen Lights Search (this wasn't in other search engines), we found an interesting master's thesis from University of Toronto:
Abrams, David (1997). Human Factors of Personal Web Information Spaces. Knowledge Media Design Institute Technical Report #1, University of Toronto.
The empirical subject is bookmarks and how people use them, enumerating many problems from a survey. However, the thesis is an encyclopedia of web uses, including many useful categorizations of web practice and empirical data. Highly recommended.
Automatic hyperlinking is a new field combining information retrieval and document structuring. For example, "Does Navigation Require More Than One Compass" by Daniela Rus and James Allen shows ways that semantic links may be drawn within a corpora of documents. In a sense, that's what our concept browser does - all documents containing a specific term are linked together. But the author's proposed approach drives more toward ways of permanently linking documents with more semantic content than simple use of terms. One of the surprising keys to making this approach work is ascribing more meaning to document parts, e.g. labelling theorems as "theorem", a practice common to older word processors such as Scribe and LaTex. The emerging technology XML, derived from SGML, is likely to provide many of these opportunities to put structure labelling back into documents, associating structure with appearance separately.
"Category Translation on the Web: Learning to Understand Information on the Internet" (Perkowitz, Etzioni) discusses probabilistic means of relating documents, particularly suitable for very large databases or highly structured data such as email addresses.
IJCAI-95 Workshop on Context in Natural Language Processing lays out a set of questions addressing the nebulous concept of context, e.g. how do you tell whether one context is the same as another. As natural language processing evolves, it may be possible to identify significant contexts, such as industry (automotive vs. medical), depth of document (advertising vs. technical report), or scope of authorship (personal bookmarks V.S. group authored white paper V.S. prize-winning paper).
An extended collection of papers on "Usability on the Web"(International Journal of Human Computer studies) discusses several experiments. An example of serious empirical studies is Tauscher, L. and Greenberg, S. How people revisit web pages: empirical findings and implications for the design of history systems . They point out the great confusion of browser users on what history is, what URLs are being stacked and what forward-backward mean. Since alternative history mechanisms are not likely to replace the fossilized features of modern browsers, it's not clear how the resuls might be used except in add-ons or other browser utilities. Existing packages HindSite and Zooworks package all pages passing trhough a browser and then index the pages for retrieval by URL or phrase. Other packages accelerate browsers by pre-loading pages permitting more flexible browsing within clusters of pages. We observe, however, that the simplest mechanism of high utility is simply to keep an index of all pages passing into a browser cache as a separate file. A former version of Microsoft Internet Explorer did this, making our LodeStar tools especially useful for processing these histories. Again web utilities exist that index browser's caches. Our concern is that studies like this are focused on features that are already known to be deficient, that what is learned is hard to apply within the current 2-browser marketplace, and that generally we seem to be moving away from the big picture (such as the information foraging model) into interface details and naive human behavior.
The book Secrets of the Super Net Searchers by Reva Basch contains intervuews with numerous Internet figures and experts in the practice of using online databases such as Dialog and Knight Ridder. In general, those librarians and information brokers with many years of experience have higher expectations of search engines because of their past experience with precise, tuned, documented query interfaces (usually text). The allusion to "Gray Literature" (publicity, products, people information) is frequent as a contrast with traditional information types (meeting bibliographic standards) and a distinctive feature of the Internet. Studies of how these experienced professionals use these commercially tuned databases have also shown that only about 20% of the available information in the Internet is actually accessed during queries. While there will always be time limits (and for these professionals also $100/hr online charges), there does not appear to be the packaging tools for collections that we are discussing here. One further concern of these 20 information professionals is that quality is especially a problem on the Internet, constrasting with the commercial databases which were more likely to have certified and regular sources of information. Reading these articles brings an awareness of the near randomness of the web in contrast with current physical library and commercially packaged databases.
To summarize, there is considerable literature on navigational models, on information space characteristics, and on user behavior. We found no model exactly like ours but rather that our approach overlaps many. We believe that the domain-link-concepts combination is distinctive because it draws out clues from different types of sources. In addition, LodeStar presents these views in a limited but familar and consistent interface with myriad operations. Particularly beneficial to add would be: more graphical visualizations, additional text analyses (e.g. document typing and structure extraction), and a clearer browsing model combining our views with traditional one-URL-at-a-timebrowsing.