BiC is a foraging strategy. In this section, we clarify its appliability (tppic collections) and identify some of the "laws" of information space we've observed.
As an example of the representations used during BiC, Appendix D provides an annotated outline for the .org domain "information foraging".
Consider the following types of Web user tasks:
Just Browsing - looking through, possibly complex material, without clear goals
Searching - looking for some fact, particular type of material, or specific file or location
Collecting - identifying materials on a particular topic and saving pointers and/or files
Qualifying - assuring oneself or another that a particular item meets criteria such as topic inclusion, quality standards, or due diligence requirements
Scoping - defining the boundaries of a topic, especially with respect to a given collection
Rating - applying a set of criteria to a given item, placing it in define categories
Dividing - splitting a collection according to some qualifying, scoping, rating, or collecting requirement
Merging - brining together and identifying common elements in different collections
Slicing - focusing items of a collection on a particular aspect of the collection's subject
Reporting - providing descriptions of content and overall collection structure
Updating - identifying materials that have changed and how so, new materials on a topic, and taking action on changes
We use the term "browsing" to encompass all the aspects of the above activities that include going from one item to the next to perform needed actions. For example, browsing for updating means going from item to item (possibly skipping many) to find those items with changes requiring either notes of change or replacement with changed items.
The emphasis of "browsing in context" is that the browser (person) is either
Of course, there's always a "context" of some sort, e.g. a search engine's results list is a product of its index, query, and relevance mechanism, possibly also influenced by the load on its system, advertising pressure, etc. In our model, rather, the browser (person) is operating in contexts they have imposed upon the material and therefore have better, or further control of.
Contexts in our terms are basically orderings that provide:
continuity - one item has something definite in common with or different from its predecessors or successors
extremes - there's a place to start and a place to end and progress through items can be measured
visualizations - ways of showing grouping
The browser (person) processing large collections of materials gains from contexts the ability to focus on defined features and then process similar items together, with similarity being well-defined and kept in mind. Thus, the browser (tool) should support the browser (person) need for context.
The objective of information foraging (see sample references in Appendix D) is to discover regularities in information intensive activities by looking upon the problem in ecological terms - a system of materials supporting a predator-prey interplay of activities. The raw materials are information items distributed, created, modified, duplicated, etc. across networks. The predator-prey relationship characterizes the browser (person) seeking bits of good materials in rich or diverse patches, i.e. foraging.
Our model is basically a two-phase approach. First, the browser identifies possible desirable items, then acquires them for off-line processing. The second phase is where we place our emphasis, how the "offline", i.e. already collected and localized, activities may be supported to better define collections to maintain and use. Perhaps this is like collecting seeds, or genes, with the right parentage, features, and possibilities, while trying to understand the total pool and each item's relationship to the whole, also encouraging wild and singular items as members that broaden the scope and overall evolution of the total pool.
The Information Foraging approach is as much a research methodology as it is a metaphor. The methodology includes attempts to mathematically define the regularities as laws that offer predictability and explanation. Experiments are then performed to better understand the effects of the regularities on systems and users with intent to improve system's performance, develop better tools, and provide training to users.
We have also identified a few regularities (amplified on in empirical baselines), although these are far from adequately measured nor are they statistically characterized.
suggests that a part of each phase above (acquisition, evaluation) must be to identify and resolve action for dead URLs. In fact, while many URLs are indeed dead, many are simply mutated (reorganized server file system) or moved. In many cases, it is easy to resolve a dead URL by visiting the site and either finding the changed URL or looking it up in an archive. the significance of this action is in its cost: 100 dead URLs in a collection of 1000 x 10 minutes per URL = a day's work. Recognizing the existence of this law helps estimate cost of acquiring a collection and valuating its items (what if the missing URLs are from a key site?).
is both a fact of life and a result of rapid commercial transition of information retrieval technology into widespread use without the benefit of library-style metadata. \While some search systems are extended from technology regularly evaluated for precision and recall, most of the major commercial search engines have evolved from a mixture of technology concerns and commercial opportunities. There is no particular reason a search engine company should meet higher than necessary standards when its revenue is derived from advertising and auxiliary information sources. That metadata (library-style classification, even mundane details such as authorship and date) is lacking comes from the rapid growth of the Web, particularly in the realm of "gray literature" (products/services literature, individual publications). The burden then imposed upon serious web users is (1) acquire skills and expectations to manage search engines effectively and (2) find and use tools that assist in the resolution of off-topic URLs.
There may well be tradeoffs in the process of acquiring and qualifying collections of URLs, e.g. using a metasearcher may dig deeper into search engines far more rapidly than individually entered queries, but at the cost of more off-topic URLs because it's not "one query fits all". While we do not have an accurate characterization of the range for either of these laws, we can consider the consequences of either broadening the range or extending the boundaries. If only 1% of URLs in a collection were dead, we might still have to look at all 100% to determine that 1% but we would probably move the aliveness test somewhere later in the process. If 50% of all URLs were dead, we probably would be looking for better search engines or giving up our goals. The estimates we have made suggest that the problem is in the manageable, but costly range.
is more subjective but again suggestive of actions. Considering that quality is both a matter of objectives for the collection and subjective to the evaluator, the main benefit of this estimate is its suggestion that better evaluation tools are required.
This is the primary hypothesis of the experiment we designed, that a more managed process of collection and evaluation using the "browsing in context" approach will produce better results for a variety of types of standard tasks. For example, we've identified the frequent occurrence of collections with perhaps 5% of the URLs being job postings or resumes, particularly where an industrial practice or technology is involved. Since the existence and quantity of such URLs may tell us something about the vitality and currency of the subject, these URLs help characterize a topic collection but not in any particularly usable or enduring way. Thus, a job-filtering function of an evaluation system would be helpful, if not required. Literature on Information Foraging
Finally,
derives from the behavior of the web population and suggests ample opportunity for organizational improvement. Consider the academic population, where pages are posted for self-reference with, increasingly, pointers to directly related work available online. But that's still a small proportion of the total links that could be drawn among web pages. A real web of a subject would have 100s of links between any two pages, e.g. in common terms used. More evolved argumentation structures such as IBIS (issue-based information systems) might draw out more significant relationships among documents, e.g. issue-position-argument or refines-generatizes or theory-practice or implements-evaluates.
Thus we see a range of questions that our "browsing in context" approach and tools can address, although imprecisely and, currently, informally. Of most interest to us in the future are, at the lower-level, trade-offs and, higher, stronger intellectual relationships that derive from hypertext and argumentation.
We discuss this further in the literature review, but a comparison is worth discussing here, namely alternative ways of using text analysis to assist foraging.
We have not yet experimented with text-based clustering but rather have found useful an explicit and directly manipulated representation of terms from documents. While clearly limited in scale, we are finding, using the SWAPI technology, some 30,000+ concepts in a 1000 URL collection. Our Concept Browser permits us to scroll up and down on a compressed list of concepts to find ones of interest and we can also order concepts by frequency. A concept can then be "browsed" across occurrences within the documents by selecting a summarized or few surrounding words of interest and then navigating to that point in documents. This strategy is more one of bottom-up, small clusters in contrast to the algorithmic feature vector clustering of scatter-gather. An experiment might well be defined here, e.g. to determine how large a concept list can be managed with this type of direct representation and manipulation, how users identify concepts of interest, how they process items once found to be of high or low quality based upon a term usage, and how this approach could be improved with thesaurus, glossary, and other language aids.
An anecdote may illustrate how this approach has worked well. On a due-diligence consulting assignment, the online documents from a research center was downloaded and indexed.
The main issue here is cost: of downloading and storage (2 hours, 20 MB), of user time (this instance was fully automated, with the user entering the URL, then copying the file to portable disk drives), of pre-processing (again automatic but consuming some time and planning). Other values include variable use of the materials (unanticipated queries), learning (seeing the kind of content and names of participants by browsing the concept list), Scoping (seeing that the center had multiple, interrelated research projects on medical informatics), and slicing (finding all the pages that related to the center's industrial affiliate program).
We believe that foraging is only part of the overall tasks for web users, also including rather mundane bookkeeping, file management, report generation, waste reclamation, etc.