Browsing In Context -- an Information Foraging Model

BiC is a foraging strategy. In this section, we clarify its appliability (tppic collections) and identify some of the "laws" of information space we've observed.

As an example of the representations used during BiC, Appendix D provides an annotated outline for the .org domain "information foraging".

Information Foraging Actions - What do Web Users Do?

Consider the following types of Web user tasks:

We use the term "browsing" to encompass all the aspects of the above activities that include going from one item to the next to perform needed actions. For example, browsing for updating means going from item to item (possibly skipping many) to find those items with changes requiring either notes of change or replacement with changed items.

The emphasis of "browsing in context" is that the browser (person) is either

  1. building a mental model of the collection and its attributes using different views and analyses, or
  2. taking action on items within some known "context", such as filtering, rating, annotating, displaying, etc.

Of course, there's always a "context" of some sort, e.g. a search engine's results list is a product of its index, query, and relevance mechanism, possibly also influenced by the load on its system, advertising pressure, etc. In our model, rather, the browser (person) is operating in contexts they have imposed upon the material and therefore have better, or further control of.

Contexts in our terms are basically orderings that provide:

The browser (person) processing large collections of materials gains from contexts the ability to focus on defined features and then process similar items together, with similarity being well-defined and kept in mind. Thus, the browser (tool) should support the browser (person) need for context.

Information Foraging Propositions

The objective of information foraging (see sample references in Appendix D) is to discover regularities in information intensive activities by looking upon the problem in ecological terms - a system of materials supporting a predator-prey interplay of activities. The raw materials are information items distributed, created, modified, duplicated, etc. across networks. The predator-prey relationship characterizes the browser (person) seeking bits of good materials in rich or diverse patches, i.e. foraging.

Our model is basically a two-phase approach. First, the browser identifies possible desirable items, then acquires them for off-line processing. The second phase is where we place our emphasis, how the "offline", i.e. already collected and localized, activities may be supported to better define collections to maintain and use. Perhaps this is like collecting seeds, or genes, with the right parentage, features, and possibilities, while trying to understand the total pool and each item's relationship to the whole, also encouraging wild and singular items as members that broaden the scope and overall evolution of the total pool.

The Information Foraging approach is as much a research methodology as it is a metaphor. The methodology includes attempts to mathematically define the regularities as laws that offer predictability and explanation. Experiments are then performed to better understand the effects of the regularities on systems and users with intent to improve system's performance, develop better tools, and provide training to users.

We have also identified a few regularities (amplified on in empirical baselines), although these are far from adequately measured nor are they statistically characterized.

suggests that a part of each phase above (acquisition, evaluation) must be to identify and resolve action for dead URLs. In fact, while many URLs are indeed dead, many are simply mutated (reorganized server file system) or moved. In many cases, it is easy to resolve a dead URL by visiting the site and either finding the changed URL or looking it up in an archive. the significance of this action is in its cost: 100 dead URLs in a collection of 1000 x 10 minutes per URL = a day's work. Recognizing the existence of this law helps estimate cost of acquiring a collection and valuating its items (what if the missing URLs are from a key site?).

is both a fact of life and a result of rapid commercial transition of information retrieval technology into widespread use without the benefit of library-style metadata. \While some search systems are extended from technology regularly evaluated for precision and recall, most of the major commercial search engines have evolved from a mixture of technology concerns and commercial opportunities. There is no particular reason a search engine company should meet higher than necessary standards when its revenue is derived from advertising and auxiliary information sources. That metadata (library-style classification, even mundane details such as authorship and date) is lacking comes from the rapid growth of the Web, particularly in the realm of "gray literature" (products/services literature, individual publications). The burden then imposed upon serious web users is (1) acquire skills and expectations to manage search engines effectively and (2) find and use tools that assist in the resolution of off-topic URLs.

There may well be tradeoffs in the process of acquiring and qualifying collections of URLs, e.g. using a metasearcher may dig deeper into search engines far more rapidly than individually entered queries, but at the cost of more off-topic URLs because it's not "one query fits all". While we do not have an accurate characterization of the range for either of these laws, we can consider the consequences of either broadening the range or extending the boundaries. If only 1% of URLs in a collection were dead, we might still have to look at all 100% to determine that 1% but we would probably move the aliveness test somewhere later in the process. If 50% of all URLs were dead, we probably would be looking for better search engines or giving up our goals. The estimates we have made suggest that the problem is in the manageable, but costly range.

is more subjective but again suggestive of actions. Considering that quality is both a matter of objectives for the collection and subjective to the evaluator, the main benefit of this estimate is its suggestion that better evaluation tools are required.

This is the primary hypothesis of the experiment we designed, that a more managed process of collection and evaluation using the "browsing in context" approach will produce better results for a variety of types of standard tasks. For example, we've identified the frequent occurrence of collections with perhaps 5% of the URLs being job postings or resumes, particularly where an industrial practice or technology is involved. Since the existence and quantity of such URLs may tell us something about the vitality and currency of the subject, these URLs help characterize a topic collection but not in any particularly usable or enduring way. Thus, a job-filtering function of an evaluation system would be helpful, if not required. Literature on Information Foraging

Finally,

derives from the behavior of the web population and suggests ample opportunity for organizational improvement. Consider the academic population, where pages are posted for self-reference with, increasingly, pointers to directly related work available online. But that's still a small proportion of the total links that could be drawn among web pages. A real web of a subject would have 100s of links between any two pages, e.g. in common terms used. More evolved argumentation structures such as IBIS (issue-based information systems) might draw out more significant relationships among documents, e.g. issue-position-argument or refines-generatizes or theory-practice or implements-evaluates.

Thus we see a range of questions that our "browsing in context" approach and tools can address, although imprecisely and, currently, informally. Of most interest to us in the future are, at the lower-level, trade-offs and, higher, stronger intellectual relationships that derive from hypertext and argumentation.

Information Foraging Technology

We discuss this further in the literature review, but a comparison is worth discussing here, namely alternative ways of using text analysis to assist foraging.

We have not yet experimented with text-based clustering but rather have found useful an explicit and directly manipulated representation of terms from documents. While clearly limited in scale, we are finding, using the SWAPI technology, some 30,000+ concepts in a 1000 URL collection. Our Concept Browser permits us to scroll up and down on a compressed list of concepts to find ones of interest and we can also order concepts by frequency. A concept can then be "browsed" across occurrences within the documents by selecting a summarized or few surrounding words of interest and then navigating to that point in documents. This strategy is more one of bottom-up, small clusters in contrast to the algorithmic feature vector clustering of scatter-gather. An experiment might well be defined here, e.g. to determine how large a concept list can be managed with this type of direct representation and manipulation, how users identify concepts of interest, how they process items once found to be of high or low quality based upon a term usage, and how this approach could be improved with thesaurus, glossary, and other language aids.

An anecdote may illustrate how this approach has worked well. On a due-diligence consulting assignment, the online documents from a research center was downloaded and indexed.

  1. The consultants needed a quick answer to the following question: "what kind of query mechanism does the evaluated technology use?" A foray online to the center and use of its online system would have taken several minutes while the downloaded and indexed collection was immediately available. It happened that the index for 'query' (quite a lot of terms) lead immediately to the page where the query language was described. It is not clear that a clustering approach would have lead as quickly to the answer as the complete displayed index of terms and it is also clear that pre-processing and prior acquisition of materials was a large factor.
  2. At another point, the investment firm for whom the technology was being assessed needed a list of the participants of the research center. "Who might have licensed the technology or had additional inside information?" "How much interest had the technology already attracted from industry as potential licensors, partners, or users?" It happened that these pages were scattered around the website, but easily identified using LodeStar's VCR-like navigation of web pages (feeding URLs into a browser), its marking operation (selecting ones to display), and directory (copying all the interesting files into a subdirectory). Within a few minutes of browsing the Domain outline, the materials were on the fax to the customer.

The main issue here is cost: of downloading and storage (2 hours, 20 MB), of user time (this instance was fully automated, with the user entering the URL, then copying the file to portable disk drives), of pre-processing (again automatic but consuming some time and planning). Other values include variable use of the materials (unanticipated queries), learning (seeing the kind of content and names of participants by browsing the concept list), Scoping (seeing that the center had multiple, interrelated research projects on medical informatics), and slicing (finding all the pages that related to the center's industrial affiliate program).

We believe that foraging is only part of the overall tasks for web users, also including rather mundane bookkeeping, file management, report generation, waste reclamation, etc.