Do Search Engines Suppress Controversy?

A June 2003 U.S. FCC ruling on media ownership consolidation raised the question: Does the Internet provide effective alternatives to TV, radio, and news channels? Are websites cost-effective outlets for expression of diverse opinions and fair dissemination of information? Some researchers (see references) suggest the Web is much less than a distinct alternative to traditional media, and even presents similar barriers and limits. Search engines -- the central force for finding information on the Web -- have inherent biases, as does any technology.  Perhaps we can test one distinctive aspect of this question: do search engines suppress controversy?
 The Web is a Puzzle (sometimes)
Suppose you query a search engine on a topic of new interest or to meet a current need. Perhaps the herbal remedy "St John's Wort" intrigues you, or you're planning a trip to the Central American country of "Belize", maybe considering getting another degree via "Distance Learning", helping your child write a report on "female astronauts", or reviewing the life and times of "Albert Einstein". What if each topic came back in the top search results without any apparent controversy: no disagreements about effectiveness of the herbal remedy, nothing but enticing beach hotels and ecotourism in Belize, a roster of good-looking sites promising learning at your own pace, a space history emphasizing first Russian and American women astronauts, or a plethora of quotation-filled biographies that briefly allude to Einstein's pre-WWII life in Europe. Each topic appears as if the search engine had a distinctly "sunny personality", telling you nothing suspicious or unpleasant about any of these five topics.

Would you have received a fair representation of the web content on each topic? You'd certainly find well-written, informative pages and web sites to surf from. But would the search engines have suppressed underlying controversies, facts and disputes that might alter your medical, educational, or travel plans? Might you miss some alternative views of the space program or the life of Einstein that could be instructive for you or your children? Or could you be lulled into thinking a topic rather boring because you miss the richer indicators of worldly activity, such as multiple viewpoints, changing historical perspectives, and reflections of social norms? If underlying controversies exist, how would you find them?

Where's the Controversy? Experiments on 5 popular topics

We investigated specific controversies for each of the above five topics then measured how much each controversy appeared in each of the general queries ("Belize", "distance learning", "Albert Einstein", "St. John's Wort", "female astronauts"). The results were mixed, 2 controversies showed and 3 were virtually missing when search results were plied from three popular engines (Google, Teoma, and AllTheWeb) and two meta-searchers (Profusion and Copernic, querying and collating from different engines). Most enlightening were the factors that suppressed or revealed the controversies.

Controversy Topic: Belize Guatemala border1. "Belize" looks to search engines as mostly hotels and ecotourism, reflecting the primary industry of the country - ruins, reefs, and jungles. Online also are country fact sheets, historical chronologies, local newspapers, and several Belizean government web sites. The underlying controversy we sought was the Belize-Guatemala border dispute -- claims from the early 19th century that Guatemala owned Belize -- now being mediated by international organizations. The dispute is rich in the history of colonialism (Belize gained independence in the 1960's), a major saga for a small developing country, and an occasional tragedy for settlers. Ask for "Belize Guatemala border dispute" and you'll get many descriptive and passionate historical and political pages. But the top search results will mention the dispute only as small items in country fact sheets.

Controversy topic: distance learning leading to decay of academia2. "Distance Learning" search results are a web of suppliers and trade associations with an occasional page of links over to the analytic literature researching quality of learning. In fact, around 1998, technology historian David Noble coined the term "digital diploma mills" to raise red flags about commercialization of the academic enterprise by outside interests (and university administrations), subverting the control of faculty over their own content and their relationships with students. Thought provoking articles reproduced online stimulated wide debate in socio-technical newsletters and fora. But this controversy will be found on only one library-based page of links, leaving the reader to perhaps assume academia not only accepts distance learning as equal to traditional learning but has fully embraced its commercial potential. Could someone searching for "distance learning" be mislead into a commercial maze of choices rather than cautioned about their options?

Controversy topic: Mileva Maric, first Mrs. Einstein3.  Surely, everything must be known about "Albert Einstein" and his early 20th century life, as presented in a multitude of biographies on science and reference sites. Well, not quite, with emerging suggestions there's more to his first marriage to Serbian science student Mileva Maric. Her characterization in his biography is "they met in physics classes in Zurich, became lovers and colleagues, gave up an illegitimate daughter, married against family wishes, settled down (with two more sons) for Albert to pursue his career, and Maric just didn't make it through her graduate qualifiers. After a while, Albert strayed from his unhappy wife and married again, emigrating to the U.S.". However, letters and investigations by scholars of early women scientists suggest Maric contributed rather more than has been acknowledged to Einstein's work, leading up to his Nobel prize (which monies were her divorce settlement). Will a search on "Albert Einstein" yield this "first wives club" tale of woe? No, but a search on "Einstein Maric" or "Mileva Maric" will.

We've now seen 3 controversies where considerable web material showed up in queries specific to the controversy ("Belize Guatemala border dispute", "David Noble" and "digital diploma mills", and "Mileva Maric Einstein") but not in the simple query on the broader topic ("Belize", "distance learning", "Albert Einstein"). What's suppressing these controversies? The main reason is organizational clout: the usual services and clearing houses of tourism, ecology, and archeology in Belize; large schools with distance learning services and associations promoting distance learning; and reference and science sites for Einstein biographies and quotations. The sites containing controversy have more analytic pages (data, detail, less glossy), are more from individuals and smaller organizations, and are less inter-linked and coordinated. There are no largish central web sites for Belize-Guatemala disputes, David Noble, or Mileva Maric.

Controversy topic: Jerrie Cobb's astronaut ambitions4. However, search for "female astronauts" and up will come pages about the "Mercury 13". This group of women aviators passed preliminary physical and psychological tests but was rejected by NASA in 1961. Why? The women weren't test pilots (restricted to male military), astronauts at the time were uncomfortable with changes in the "way things are", and loss of a female astronaut might jeopardize the whole space program. However, the tale of these brave and skilled pilots, lives already risked as World War II WASPs, their careers disrupted for this secret training, and their aspirations for contributing to the US space program have been told in two recent books "Mercury 13" by J. Ackmann and "Promised the Moon" by S. Nolen. 20th and 40th anniversaries of American and Russian women astronaut firsts and the 1998 geriatric flight by original Mercury astronaut John Glenn combined with take-up of the cause by many women's organizations, current astronauts (Commander Eileen Collins), and the still adventurous surviving pilots. The controversial subtopic appeared in roughly 15% of search results, enlivening a rather stale list of "first women" exploits and a saddened manned space program. Although organizational clout (NASA) dominates the primary query "female astronaut", this controversy is well orchestrated with web sites, petitions, news columns, and press releases.
Controversy Topic: St. John's Wort and depression

5.
"St. John's Wort"
exhibits recent controversy over effectiveness of the herb for its claimed mood enhancement and depression reduction. With rising public interest in alternative medicine, the National Institutes of Health and several drug companies, in 2000, began clinical trials. Government-issued advisories on certain side effects (primarily for HIV patients) and questions about comparative effectiveness were rapidly promulgated by many syndicated health columns and newsletters. A simple search on "St. John's Wort" draws out many herbal descriptions and myriad storefront shops, as well as these cautionary newsletters. A shopper just looking for a good buy might well bypass the advisories, some of which are long and analytic. A reader with an open mind would likely find the advisories useful, although the trials are hardly yet conclusive.

To summarize, suppressing factors include organizational clout, search engine preference for glossy over analytic material, and considerable duplication of top results. Some controversial topics suffer from poor coordination, interlinking, and low status of web sites. Factors that reveal controversies include: explicit promotion, timeliness, media interest, and social relevance.

You mean, the Web isn't all that different from traditional media?

Well, duh, these results are pretty much to be expected and in many ways comparable to traditional media. Websites exist to promote their organizations,"money talks", and search engines return what society is currently interested in. What's different on the Web? First, links are the "currency of the Web", amplifying inter-organizational relationships, and suppressing smaller sites that don't inherit links from high status sites or that link poorly among themselves. Links and hypertext transcend citations and catalogs in libraries and periodicals. Second, search engine ranking reinforces that people are less interested in controversial subtopics: border disputes, academic defensiveness, early 20th century science careers of women, rejected women aviator adventures, and long-going clinical trials. But, on the Web, these controversial topics are only one query away, if the searcher knows the right keywords. Third,  search engines are businesses, not a service to society. Search engines seek to provide relevant and widely useful popular pages to searchers who also click ads and paid placements, leaving the more detailed and perhaps less pleasant content to searchers who are willing to work harder. Of course, a search engine needs a positive outlook; it's another form of sales, not a librarian, professional expert, or social activist.

The dilemma is anybody can find the controversies --- if they know the right query terms. But if the top results don't reveal the controversy, it's quite easy to be lulled into the "sunny side" of the topic and miss the more cautionary or interesting "darker side".

Further distinctions exist in the growing literature on search engines and biases of technology (see references below). Observed "power" laws describe a Web where traffic and links for many topics are dominated by a few sites; most sites are neither linked to nor often visited. With search engines basing query results on links, smaller and newer sites are harder to find by crawlers, then rank lower, and consequently receive even less attention except by more narrowly interested parties who find them by specific queries or blogs. Furthermore, even the most potent search engines index relatively small parts of the Web, maybe 25% or less, and different engines index different parts, all missing the "Invisible Web" of dynamic pages and submerged databases. Thus, small, new, or alternative sites face obstacles first becoming retrievable, with no guarantees of high visibility via search engines. Research on the effects of these inherent biases is difficult due to the scale of the Web, unpredictability of proprietary search engine strategies, and mixed expectations of Web quality. Technology, politics, economics, even physics, all help characterize behavior of the Web, helping to explain our experimental results.

Can Web behavior be changed? Must controversies remain submerged?

Whether these five topics represent the wider Web or indicate trends is hard to judge without further experimentation. However, from these results, we can hypothesize a more "Objective Web" which exhibits greater diversity and fairness. We've seen that Web behavior is governed by three interlinked communities: web page authors who promote topics and bestow links, search engines that crawl and rank via links, and searchers who reward authors and engines with their attention and clicks. Search engines certainly cannot be expected to become informed, context-aware librarians striving for collection development or professional topic experts offering penetrating and balanced advice. However, they could distinguish better and provide more support for the Analytic Web (details, analysis, longer content) versus the Organizational Web (real world institutions, showing their Web presence). For example, Teoma.com offers an "enthusiasts" section listing heavy linkers and focused topic sites that often tap into more diverse and selected resources. Perhaps the "Semantic Web" mission to bring structure and meaning to web content will offer an alternative to current searching.

In general, search engines reflect the links and pages of website authors. Such authors might adopt a more objective and extensive linking policy, comparable to scientific paper citation, more thoroughly addressing (and linking to) pages with agreeing, opposing, and neutral objective viewpoints. A more aggressive linking strategy among analytic and controversy-expressing sites could exert influence on search engines as well as benefit readers.

Certainly, our five topics suggest additional advice for building good searches. Since engines don't overlap much or cover all the web, it's usually advisable to search multiple engines for a more comprehensive search. Indeed, these experiments showed that search engines overlapped around 30% and were roughly comparable (within 10% of each other) at exposing controversy pages. No, Google wasn't that much better. More engines + more queries = more chance of exposing controversies. The "sunny disposition" of search engine also recommends "looking for trouble" by digging deeper into search results, seeking the most informed link pages, being skeptical of commercialism and advertising, deliberately going beyond the Organizational into the Analytic Web. Just asking "Topic AND Controversy" or some appropriate synonym (dispute, opposition, objection, etc.) may reveal submerged controversial content, but a good controversy search requires knowing the precise keywords. Clustering search engines such as Vivisimo and (the former) Northern Light also may highlight terms associated with controversies.

So, do search engines suppress controversy?

Search engines are extraordinarily powerful technology that we are absolutely dependent upon for using the Web. These databases, algorithms, and user interfaces aren't conspiring to suppress controversy, it's just the way they work as good business sense in an information world barely a decade old. Searchers need to learn their biases and how to counteract them with better, more diverse, harder probing queries. Page authors need to link more carefully, extensively, and objectively since each page adds to the web of their topic and its influence on search engines relative to competing topics.  Every link counts.

References

Politics of Search Engines

Empirical studies of the Web

Bias Studies

Experimental methodology and Results

An expanded version of this paper appears in First Monday, Jan. 5, 2004. Lists of URLs used in the experiments are available Also online is an experimental query expander, a Controversy Discovery Engine

We collected URLs from 5 search engine sources
:
Google, Teoma, AllTheWeb (FAST) and two multi-searchers web-based Profusion (Altavista, About, AOL, Lycos, Raging Search, Wisenut, Metacrawler, MSN, Adobe PDF, Looksmart, Netscape, Teoma, AllTheWeb) and desktop Copernic (Ah-ha, Altavista, AOL, Euroseek, AlltheWeb, Findwhat, Hotbot, Infospace, Looksmart, Lycos, Mamma, MSN, Netscape, OpenDirectory, Teoma, Wisenut, Yahoo). Appropriate simple queries about the broad topic and about the controversy were collected, top 50 for single engines and top 100 for multi-searchers.Searches were performed during August 2003. All URLs were merged together using a Windows URL Analyzer, twURL. Each topic contained 500-900 distinct URLs, which were browsed using twURL views (link counts, domains, keywords) , tossing out off-topic and dead URLs. Each was rated as "Deep" (right on target the controversy,  so that a searcher wouldn't miss the story), "Revealing" (with links to or passing mentions of the controversy, but a searcher might well miss the controversial subtopic or its importance), and "Other" (informative, relevant URLs but not about the controversy). Summary data on the overlap of the simple shows that




(1) for the controversy queries (ABOVE), there are many deep and revealing URLs, i.e. controversy pages exist for the right query (average 300 URLs per controversy), but (2) the controversial pages were suppressed or submerged in the simple topic (BELOW):

Chart of Controversy in Simple Searches


Author Information

Susan Gerhart teaches software engineering, databases, and discrete mathematics at Embry-Riddle Aeronautical University, Prescott AZ (http://pr.erau.edu/~gerharts). She was recently active in developing and evaluating computer security modules for undergraduate curricula (http://nsfsecurity.pr.erau.edu). Dr. Gerhart is also the developer of the twURL web analyzer and has created a portfolio of topics at "the twURLed World"

Updated Jan. 11, 2004. Posted November 15, 2003 at http://www.twurl.com/Controversy