The behavior of
the
Web depends upon interlocking communities and their objectives: (1)
authors
whose web pages link to other pages; (2) search engines
indexing and ranking those pages; and (3) information seekers whose
queries
and surfing reward authors
and support search engines. Although technological and personal
bias is inevitable, systematic suppression of controversial topics
would indicate
a flaw in the Web's ideology of openness and informativeness. This
paper's
experiments explore search engines' bias by asking: is a specific
well
known controversy revealed in a simple search?
Experimental
topics include: distance learning, Albert Einstein, St. John's Wort,
female
astronauts, and Belize. The experiments suggest that simple queries tend to
overly present the "sunny side" of these topics, with minimal
controversy. Alternative behaviors for a more "Objective Web" are
analyzed: (a) web
page authors adopting research citation practices, (b) search engines
balancing
organizational and analytic content, and (c) searchers practicing more
wary
multi-searching.
Do Search Engines have
Multiple Personalities? An Experiment.
Summary:
Search engine indexing and
ranking mechanisms favor "sunny personalities", where a naive query
leads
to the best news about a subject and more astute questioning is
required
to reveal controversies and the darker side of the queried subject. By
analogy, search engines might respond with personalities more like a
human
subject area expert who provides more sides to a subject as well as
means
to evaluate the query responses. An experiment shows that 3 search
engines
- Google, Teoma, and AllTheWeb - exhibit dominantly "sunny"
personalities
on the subject distance learning but may be prompted by asking
for
"distance
learning" AND controversy to reveal "well-balanced" resources as
well
as the "darker" personality of an attacker's phrase digital diploma
mills.
Problem:
Suppose you asked a so-called
expert about a hot topic like "distance learning". Maybe you're
thinking
of picking up an extra degree, switching careers, trying to find a
cheaper
quality education path for your children, or expanding your
institutional
business. You'd expect that expert to tell you quite a bit of what
he/she
knows (recall), to stay on topic (relevance), and to have
some type of priority on facts and references. You might also
expect
the expert to have some concept map of the topic, evaluation
criteria, and a balanced perspective on the topic that
disclosed the dark side and the critical issues as well as the
positives
and glossy aspects of the subject.
Following
this analogy, do search engines
(experts) have "personalities" that lead them to ignore or cover up the
dark side of a subject? Do they make you, the searcher, ask just the
right
question in just the right way to force the search engine to disclose
controversies
and more balanced views of the topic? Our answer is: Yes, and we'll
provide
some empirical data to back up our claims on the subject "Distance
Learning"
(loosely defined as some form of different time/different place
teaching/learning
situation), abbreviated DL.
Background:
Technically astute search
engine users know how to tweak queries to achieve better recall and
relevance.
However, a surprising number of searchers are less tuned to the
underlying
web
fact that each engine has indexed a lot of, but different parts of, the
web, say 2 or 3 billion pages,, and there remains an invisible
web, obligating them to use multiple engines for a thorough
search.
Also, information pros know that the responsibility is, properly, on
the
query builder to ask the right question and ask it in the right way to
elicit a well balanced answer from whatever engine is being used. But
what
about the hapless naive user who asks the simplest or naive query - do
they, in return, deserve only the glossy stuff and none of the
negatives?
Or what happens to the sophisticated searcher who doesn't ask exactly
the
right question or makes any number of common mistakes in asking the
question
in the right way? Is it a fact on the same level as "no engine covers
the
web" that "engines will deliver the most positive side of a topic
unless
probed more skeptically"?
Empirical Study: Our
experiment consists of asking 3 different engines - Google,
Teoma,
and AlltheWeb - for the top 50
results
on variations of the query:
- "distance learning", the phrase
for
the naive query, returning lots of glossy - DL associations,
universities
practicing DL
- "distance learning" AND controversy,
to elicit web pages that expose the existence of controversy, notably
disclosing
a 1999 attack by a technology historian on the effectiveness of DL and
its deleterious effects on the infrastructure of universities and the
academic
enterprise
- "distance learning" AND "digital
diploma
mill" (the pejorative term) and "distance learning" AND "David
Noble"
(the historian), to dig deep into the darker side of DL
We
used a URL Calculator and Browser, twURL,
to perform our experiment, loading in the
search
results then rating each page as (1)
not
revealing of controversy, (2)
deep
into the heart of the controversy, or (3)
fairly balanced, admitting issues and acknowledging controversy. Our
experimental
questions were:
- Would the naive query reveal the
darker side
of DL?
- How hard would we have to work to
expose any
controversy and reach a well balanced answer?
Results
were
- Naive Query: Since twURL simply
scrapes
search result pages, we gave Teoma
some extra URLs for its resource collections and, lo and behold, these
were clear winners as leads toward better balanced results, including
controversies. AlltheWeb
and Google
revealed a useful literature review. All three engines surfaced an
Online
Journal, and Google also exposed a specific controversy, copyright. In
all, Teoma disclosed 9, AlltheWeb 2, and Google 4 URLs that were within
one link of the dark side, with only one search result having exact
references
and links on its pages. That's a total of 12 of 160 URLs in the
vicinity
of controversy, which leads to the conclusion that the naive query
search
results exhibit a "dominantly sunny personality". The remaining
URLs
were active DL sites, associations for DL, and
clearinghouses.
A naive searcher would be unlikely to stumble out of the spotlight on
success
into the questioning, if not dark, side of DL. A sophisticated user and
close reader might well rely on Teoma's resource collections but
otherwise
would likely miss the dark side.
- Skeptical Queries: It's not
hard, however,
to force any of the search engines to "disclose" controversy by asking
for, well, "distance learning" AND controversy. Luckily, the
primary
1999 controversy is so controversial that web page authors use the word
"controversy" rather than any of its synonyms or related terms. Of 147
URLs (some were 404s), 7 were direct on the controversy, 53 appeared
somewhat
balanced, 71 responsive to DL but not revealing of controversy, and 12
not responsive. Teoma again did better than Google and both were better
for relevance than AllTheWeb. A searcher scanning the search result
pages
would likely see a snippet leading to a click off to one of the
balanced
pages or into the heart of the controversy. Words like "digital
diploma
mill" jump out of the search results and the author's name "David
Noble"
(not, whoops, Nobel) leads into the deeper expositions and counter
arguments
to the controversy. Queries directly about this controversy returned
lots
of relevant pages, with many duplicates of the same article, assuring
that
the searcher wouldn't miss the controversy.
Conclusions:
So, what have we learned
here? First, this is just stating the obvious: search engine
prioritization,
such as Google's page rank, together with the active practice of
search engine promotion leads to a web of glossy pages that dominate
search
results.. The query "distance learning" is too naive to expose the
wider,
more responsible search that draws in balanced and even controversial
pages.
It takes a more skeptical query to elicit a balanced presentation of
the
subject, which worked well, on this DL topic.
A
demanding search user might ask: is that
really too much to ask - why shouldn't search engine results be better
balanced, say 50% glossy, 30% balanced, and 20% negative? Yes, the
professional
and more concerned searcher can sprinkle in :"controversy" and other
negatives,
but is this an underlying personality flaw of search engines -- balmy,
hype, susceptible, and non-disclosing?
Experimental Data:
Author: Susan
L Gerhart, May 31,
2003. Published on Research Outlet and Integration
(home of twURL).