For more information, see the ROI Website (http://www.roir.com).
We have implemented and integrated a set of rudimentary algorithms that profile collections of Web materials to build a preliminary picture of what's available, possibly answering the search goal immediately without browsing. Technically, the links (<A HREF="URL">link text <A>) are extracted from a collection of HTML files and then the URLs are sorted and merged and classified by WWW domain (.com, .edu, etc.). The premise here is that the texts of the links, read within a WWW domain setting, provide sufficient clues for a person familiar with the subject to make a preliminary judgment of the value of the referenced page. These clues, together with the "hyperlink skeleton" as outlined (in a GUI treeview), provide a kind of canonical representation that (1) can be subjected to other analyses and presentations and (2) within a GUI environment, can be marked up, edited, and traversed by other parts of the tool suite.
The HTML sources may be acquired from previous searches (explicitly saved or retrieved from the browser's cache) or one of the automatic downloaders discussed above. The canonical outline answers such needs as: show all the URLs from commercial, military, or non-US sources but not university or professional associations. Each URL can be looked at in relation to other URLs from the same host and domain as well as the specific sources in the HTML collection.
The outline structures in our GUI tool are editable, so that the above operations yield outlines (trees) that can be trimmed, marked up, counted, and rearranged into various classifications, e.g. "# links>=2" AND "in .com domain" AND "not a search engine company". These edited profiles as plans can now be used for automated, browsing sessions. Furthermore, as URLs are visited, the GUI supports the attachment of (what we call 'webits') notes, ratings or relevance classifications, and marks for further editing.
A more complete description of this GUI, twURL (tm), is available at http://www.roir.com. Briefly, it is Windows 95, written in Visual Basic, using a browser component from Catalyst Development and the Protoview Data Explorer interface component.. twURL operates within a "virtual suite" of tools that assist in downloading HTML materials, monitoring changes, and multi-searching. Specifically we use EchoSearch (http://www.iconovex.com), a multi-searcher that indexes concepts and names of downloaded materials; WebFerret (http://www.ferretsoft.com) collates in-depth, also providing (with 20MB of ZIP Drive space) for indexed text; Surfbot, (http://www.surflogic.com), a general purpose and open downloader and multi-searcher, site mapper, and URL monitor; NetAttache Pro (http://www.tympani.com), another general purpose, but not open, subscription and brief organizer. We use all search engines as much and as deeply (more than just the 10 early hits), but we find the greatest precision and depth in Ultra (http://ultra.infoseek.com), greatest number in Hotbot (http://www.hotbot.com), breadth and classification in Northern Lights (http://www.nlsearch.com). Online meta-searchers such as MetaCrawler (http://www.metacrawler.com) are also useful, although not as extensive or accountable as our desktop multi-searchers.
An extensive range of editing and viewing functions are available: marking for string content, number of links, data attributes (e.g. errors, whether a local file is available, marking within another outline), classifying URLs with these similar peroprties (string content, title, etc.), trimming marked outlines and nodes, copying/pasting/dragging outline parts. HTML reports are generated corresponding to outlines, with menus and interlinks at higher outline levels. Saved versions of the databases are maintained.
Although there are many "offline browsers" and site-downloaders on the market (see the Consummate winsock Apps list), we found it necessary to build our own simpler downloader. Other tools hold their pages in proprietary or require more setup, while providing more management tools. Tracker simply walks a tree, downloading the files, building an index file, while recording errors.
Recently added to this suite is an interface to text analysis tools from Iconoves Corporation (http://www.iconovex.com). Their SWAPI (Syntactica Web Application Programmer Interface) engine uses grammatical patterns to identify phrases corresponding to persons, places, product names, and concepts. A metasearcher, EchoSearch, queries multiple search engines, downloads files, qualifies results for refined search terms, and generates HTML reports. Unfortunately for topics of any size, the HTML reports reach 5MB or more in size, which most browses cannot load and reload fast enough for comfortable use. Our SwapiBrowser transforms SWAPI output into GUI outlines, using the same embedded browser as twURL, permitting browsing the concept lists, seeing contexts of concepts, and linking to those contexts in the files. Additional functions permit some analysis (separation of person's names, counts of contexts, files related by concepts, and concepts related by files).
twURL's underlying formal model is sets, viewing many of the marking and editing operations as special purpose queries over URL sets and links.
As LodeStar has been developing, we've found ourseleves increasingly in need of a process, not just tools, especially because we are using several interfacing tools and encounter frequent failure of one or more. Here's the outline of the process we currently use: