Updated as
"Do Search Engines Suppress Controversy?"   

Shorter Version (RECOMMENDED)


The behavior of the Web depends upon interlocking communities and their objectives: (1) authors whose web pages link to other pages; (2) search engines indexing and ranking those pages; and (3) information seekers whose queries and surfing reward authors and support search engines. Although technological and personal bias is inevitable, systematic suppression of controversial topics would indicate a flaw in the Web's ideology of openness and informativeness. This paper's experiments explore search engines' bias by asking: is a specific well known controversy revealed in a simple search? Experimental topics include: distance learning, Albert Einstein, St. John's Wort, female astronauts, and Belize. The experiments suggest that simple queries tend to overly present the "sunny side" of these topics, with minimal controversy. Alternative behaviors for a more "Objective Web" are analyzed: (a) web page authors adopting research citation practices, (b) search engines balancing organizational and analytic content, and (c) searchers practicing more wary multi-searching.




Do Search Engines have Multiple Personalities? An Experiment.

Summary: Search engine indexing and ranking mechanisms favor "sunny personalities", where a naive query leads to the best news about a subject and more astute questioning is required to reveal controversies and the darker side of the queried subject. By analogy, search engines might respond with personalities more like a human subject area expert who provides more sides to a subject as well as means to evaluate the query responses. An experiment shows that 3 search engines - Google, Teoma, and AllTheWeb - exhibit dominantly "sunny" personalities on the subject distance learning but may be prompted by asking for "distance learning" AND controversy to reveal "well-balanced" resources as well as the "darker" personality of an attacker's phrase digital diploma mills.

Problem: Suppose you asked a so-called expert about a hot topic like "distance learning". Maybe you're thinking of picking up an extra degree, switching careers, trying to find a cheaper quality education path for your children, or expanding your institutional business. You'd expect that expert to tell you quite a bit of what he/she knows (recall), to stay on topic (relevance), and to have some type of priority on facts and references. You might also expect the expert to have some concept map of the topic, evaluation criteria,  and a balanced perspective on the topic that disclosed the dark side and the critical issues as well as the positives and glossy aspects of the subject.

Following this analogy, do search engines (experts) have "personalities" that lead them to ignore or cover up the dark side of a subject? Do they make you, the searcher, ask just the right question in just the right way to force the search engine to disclose controversies and more balanced views of the topic? Our answer is: Yes, and we'll provide some empirical data to back up our claims on the subject "Distance Learning" (loosely defined as some form of different time/different place teaching/learning situation), abbreviated DL.

Background: Technically astute search engine users know how to tweak queries to achieve better recall and relevance. However, a surprising number of searchers are less tuned to the underlying web fact that each engine has indexed a lot of, but different parts of, the web, say 2 or 3 billion pages,, and there remains an invisible web, obligating them to use multiple engines for a thorough search.  Also, information pros know that the responsibility is, properly, on the query builder to ask the right question and ask it in the right way to elicit a well balanced answer from whatever engine is being used. But what about the hapless naive user who asks the simplest or naive query - do they, in return, deserve only the glossy stuff and none of the negatives? Or what happens to the sophisticated searcher who doesn't ask exactly the right question or makes any number of common mistakes in asking the question in the right way? Is it a fact on the same level as "no engine covers the web" that "engines will deliver the most positive side of a topic unless probed more skeptically"?

Empirical Study: Our experiment consists of asking 3 different engines - Google, Teoma, and AlltheWeb - for the top 50 results on variations of the query:

We used a URL Calculator and Browser, twURL, to perform our experiment, loading in the search results then rating each page as (1) not revealing of controversy, (2) deep into the heart of the controversy, or (3)  fairly balanced, admitting issues and acknowledging controversy. Our experimental questions were:
  1. Would the naive query reveal the darker side of DL?
  2. How hard would we have to work to expose any controversy and reach a well balanced answer?
Results were
  1. Naive Query: Since twURL simply scrapes search result pages, we gave Teoma some extra URLs for its resource collections and, lo and behold, these were clear winners as leads toward better balanced results, including controversies. AlltheWeb and Google revealed a useful literature review. All three engines surfaced an Online Journal, and Google also exposed a specific controversy, copyright. In all, Teoma disclosed 9, AlltheWeb 2, and Google 4 URLs that were within one link of the dark side, with only one search result having exact references and links on its pages. That's a total of 12 of 160 URLs in the vicinity of controversy, which leads to the conclusion that the naive query search results exhibit a "dominantly sunny personality".  The remaining URLs  were active DL  sites, associations for DL,  and clearinghouses. A naive searcher would be unlikely to stumble out of the spotlight on success into the questioning, if not dark, side of DL. A sophisticated user and close reader might well rely on Teoma's resource collections but otherwise would likely miss the dark side.
  2. Skeptical Queries: It's not hard, however, to force any of the search engines to "disclose" controversy by asking for, well, "distance learning" AND controversy. Luckily, the primary 1999 controversy is so controversial that web page authors use the word "controversy" rather than any of its synonyms or related terms. Of 147 URLs (some were 404s), 7 were direct on the controversy, 53 appeared somewhat balanced, 71 responsive to DL but not revealing of controversy, and 12 not responsive. Teoma again did better than Google and both were better for relevance than AllTheWeb. A searcher scanning the search result pages would likely see a snippet leading to a click off to one of the balanced pages or into the heart of the controversy. Words like "digital diploma mill" jump out of the search results and the author's name "David Noble" (not, whoops, Nobel) leads into the deeper expositions and counter arguments to the controversy. Queries directly about this controversy returned lots of relevant pages, with many duplicates of the same article, assuring that the searcher wouldn't miss the controversy.
Conclusions: So, what have we learned here? First, this is just stating the obvious: search engine prioritization, such as Google's page rank,  together with the active practice of search engine promotion leads to a web of glossy pages that dominate search results.. The query "distance learning" is too naive to expose the wider, more responsible search that draws in balanced and even controversial pages. It takes a more skeptical query to elicit a balanced presentation of the subject, which worked well, on this DL topic.

A demanding search user might ask: is that really too much to ask - why shouldn't search engine results be better balanced, say 50% glossy, 30% balanced, and 20% negative? Yes, the professional and more concerned searcher can sprinkle in :"controversy" and other negatives, but is this an underlying personality flaw of search engines -- balmy, hype, susceptible, and non-disclosing?

Experimental Data:



Author: Susan L Gerhart, May 31, 2003. Published on Research Outlet and Integration (home of twURL).