|
Search
engines use software robots to survey the Web and build
their databases. Web documents are retrieved and indexed.
When you enter a query at a search engine website, your
input is checked against the search engine's keyword
indices. The best matches are then returned to you as
hits.
There
are two primary methods of text searching--keyword and
concept. Keyword searching is far more common. Determining
the concept behind a particular web page continues to
be a challenge for search engine companies.
Keyword
Searching
This
is the typical form of text search on the Web. Most
search engines do their text query and retrieval using
keywords.
Unless
the author of the Web document specifies the keywords
for her document (this is possible by using meta tags),
it's up to the search engine to determine them. Essentially,
this means that search engines pull out and index words
that are believed to be significant. Words that are
mentioned towards the top of a document and words that
are repeated several times throughout the document are
more likely to be deemed important.
Most
search engines now index every word on every page. Others
index only part of the document, such as the title,
headings, subheadings, hyperlinks to other sites, and
the first 20 lines of text.
Full-text
indexing system generally pick up every word in the
text except commonly occurring stop words such as "a,"
"an," "the," "is," "and,"
"or," and "www." AltaVista claims
to index all words, even the articles, "a,"
"an," and "the." Some of the search
engines discriminate upper case from lower case; others
store all words without reference to capitalization.
The
Problem With Keyword Searching
Keyword
searches have a tough time distinguishing between words
that are spelled the same way, but mean something different
(i.e. hard cider, a hard stone, a hard exam, and the
hard drive on your computer). This often results in
hits that are completely irrelevant to your query. Some
search engines also have trouble with so-called stemming--i.e.,
if you enter the word "big," should they return
a hit on the word, "bigger?" What about singular
and plural words? What about verb tenses that differ
from the word you entered by only an "s,"
or an "ed"?
Search
engines also cannot return hits on keywords that mean
the same, but are not actually entered in your query.
A query on heart disease would not return a document
that used the word "cardiac" instead of "heart."
Concept-based
searching
Unlike
keyword search systems, concept-based search systems
try to determine what you mean, not just what you say.
In the best circumstances, a concept-based search returns
hits on documents that are "about" the subject/theme
you're exploring, even if the words in the document
don't precisely match the words you enter into the query.
Excite
is currently the best-known general-purpose search engine
site on the Web that relies on concept-based searching.
This
is also known as clustering -- which essentially means
that words are examined in relation to other words found
nearby.
How
does it work?
There
are various methods of building clustering systems,
some of which are highly complex, relying on sophisticated
linguistic and artificial intelligence theory that we
won't even attempt to go into here. Excite sticks to
a numerical approach. Excite's software determines meaning
by calculating the frequency with which certain important
words appear. When several words or phrases that are
tagged to signal a particular concept appear close to
each other in a text, the search engine concludes, by
statistical analysis, that the piece is "about"
a certain subject.
For
example, the word heart, when used in the medical/health
context, would be likely to appear with such words as
coronary, artery, lung, stroke, cholesterol, pump, blood,
attack, and arteriosclerosis. If the word heart appears
in a document with others words such as flowers, candy,
love, passion, and valentine, a very different context
is established, and the search engine returns hits on
the subject of romance.
Warning:
This often works better in theory than in practice.
Concept-based indexing is a good idea, but it's far
from perfect. The results are best when you enter a
lot of words, all of which roughly refer to the concept
you're seeking information about.
Refining
Your Search
Most
sites offer two different types of searches--"basic"
and "advanced." In a "basic" search,
you just enter a keyword without sifting through any
pulldown menus of additional options. Depending on the
engine, though, "basic" searches can be quite
complex.
Advanced
search refining options differ from one search engine
to another, but some of the possibilities include the
ability to search on more than one word, to give more
weight to one search term than you give to another,
and to exclude words that might be likely to muddy the
results. You might also be able to search on proper
names, on phrases, and on words that are found within
a certain proximity to other search terms.
Many
search engines now automatically recognize company names
and can direct a searcher to a corporate website when
such a name is entered as a query. Phrase recognition
is also becoming more common; i.e., you might expect
to get relevant hits for the term Cold War if you enter
it without the quotation marks that typically denote
a phrase. (In the past, you simply would have received
all documents with the words "cold" and "war"
in them.
Some
search engines also allow you to specify what form you'd
like your results to appear in, and whether you wish
to restrict your search to certain fields on the internet
(i.e., Usenet or the Web) or to specific parts of Web
documents (i.e., the title or URL).
Many,
but not all search engines allow you to use so-called
Boolean operators to refine your search. These are the
logical terms AND, OR, NOT, and the so-called proximal
locators, NEAR and FOLLOWED BY.
Boolean
AND means that all the terms you specify must appear
in the documents, i.e., "heart" AND "attack."
You might use this if you wanted to exclude common hits
that would be irrelevant to your query.
Boolean
OR means that at least one of the terms you specify
must appear in the documents, i.e., bronchitis, acute
OR chronic. You might use this if you didn't want to
rule out too much.
Boolean
NOT means that at least one of the terms you specify
must not appear in the documents. You might use this
if you anticipated results that would be totally off-base,
i.e., nirvana AND Buddhism, NOT Cobain.
Not
quite Boolean + and - Some search engines use the characters
+ and - instead of Boolean operators to include and
exclude terms.
NEAR
means that the terms you enter should be within a certain
number of words of each other. FOLLOWED BY means that
one term must directly follow the other. ADJ, for adjacent,
serves the same function. A search engine that will
allow you to search on phrases uses, essentially, the
same method (i.e., determining adjacency of keywords).
Phrases:
The ability to query on phrases is very important in
a search engine. Those that allow it usually require
that you enclose the phrase in quotation marks, i.e.,
"space the final frontier."
Capitalization:
This is essential for searching on proper names of people,
companies or products. Unfortunately, many words in
English are used both as proper and common nouns--Bill,
bill, Gates, gates, Oracle, oracle, Lotus, lotus, Digital,
digital--the list is endless.
All
the search engines have different methods of refining
queries. The best way to learn them is to read the help
files on the search engine sites and practice!
Here
are some links to the help files that Spidap finds most
useful:
Relevancy
Rankings
Most
of the search engines return results with confidence
or relevancy rankings. In other words, they list the
hits according to how closely they think the results
match the query. However, these lists often leave users
shaking their heads on confusion, since, to the user,
the results might not relevant.
Why
does this happen? Basically it's because search engine
technology has not yet reached the point where humans
and computers understand each other well enough to communicate
clearly.
Most
search engines use search term frequency as a primary
way of determining whether a document is relevant. If
you're researching diabetes and the word "diabetes"
appears multiple times in a Web document, it's reasonable
to assume that the document will contain useful information.
Therefore, a document that repeats the word "diabetes"
over and over is likely to turn up near the top of your
list.
If
your keyword is a common one, or if it has multiple
other meanings, you could end up with a lot of irrelevant
hits. And if your keyword is a subject about which you
desire information, you don't need to see it repeated
over and over--it's the information about that word
that you're interested in, not the word itself.
Some
search engines consider both the frequency and the positioning
of keywords to determine relevancy, reasoning that if
the keywords appear early in the document, or in the
headers, this increases the likelihood that the document
is on target. For example, Lycos ranks hits according
to how many times your keywords appear in their indices
of the document and in which fields they appear (i.e.,
in headers, titles or text). It also takes into consideration
whether the documents that emerge as hits are frequently
linked to other documents on the Web, reasoning that
if other folks consider them important, you should,
too.
If
you use the advanced query form on AltaVista, you can
assign relevance weights to your query terms before
conducting a search. Although this takes some practice,
it essentially allows you to have a stronger say in
what results you will get back.
As
far as the user is concerned, relevancy ranking is critical,
and becomes more so as the sheer volume of information
on the Web grows. Most of us don't have the time to
sift through scores of hits to determine which hyperlinks
we should actually explore. The more clearly relevant
the results are, the more we're likely to value the
search engine.
|