Information Retrieval (for beginners)

Information Retrieval

James Melzer

June 15, 2006

1

How Does Search Work?

2

The basics of search

• A search engine mediates between user’s query and metadata surrogates for
documents

• Documents are reduced to metadata

• User’s need is translated into a query

• Query terms are used to ﬁnd matching metadata terms

• Lots and lots of room for error...

3

The search process

1. Crawl content for metadata

2. Index document terms into an inverted ﬁle;
an inverted ﬁle is very fast to search

3. Search the index to identify the result set;
search the index - not the documents

4. Rank the results for display;
ranking is the hardest part

4

Search algorithm 1

Term-based Ranking (tf/idf)

• tf = term frequency
documents that use the query terms most are presumed to be most relevant

• idf = inverse document frequency
terms that are more rare are better indicators of relevance

• Assumptions
1) relevance can be measured with document terms

5

Search algorithm 2

PageRank (Google)

• Relevant set is still identiﬁed by term matching

• A revolution in ranking:
based on linking between documents

• Assumptions:
1) important sites link to other important sites
2) if many people link to a site, it is important

6

Citation Analysis

• Authors carefully select articles to cite

• The more citations an article gets,
the better it must be

• Citations by authors who have a lot of citations confers their power to those
they cite

• Aggregate and leverage all these small individual decisions...

7

How Complex is
Google?
Google has about
36 ranking algorithms

Examples:

Citation Analysis

Statistical Clustering

Parsing Document Structure

Parsing Data in the Document

Microcontent Parsing

8

How to Make Search Better?

9

Evaluating Search

Recall

the percentage of all relevant documents retrieved

100% recall means every relevant document is retrieved

Precision

the percentage of documents retrieved that are relevant

100% precision means only relevant documents are retrieved

10

Thoughts & Reservations about Evaluating Search

• Precision and Recall are usually inversely proportional, so improving one often
reduces the other.

• Given a corpus of content like the web (tens of billions of items)...
Recall is unmeasurable, and thus essentially meaningless

• What is relevance?

• Measuring Precision depends on an agreed deﬁnition of relevance, which is
tricky (human cataloging is only about 80% ‘accurate’ - relevance is very hard
to quantify)

Zipf
Best Bets

• Manually selected results, tied to specific query terms or phrases

• User-driven phrases
select the most-used phrases from search traffic;
go for easy wins, because returns diminish sharply

• Business-driven phrases
select phrases important to the business;
such as product names or office locations;
or politically sensitive phrases, so you can control the message people see

12

Relevance Feedback

• The user provides direct or indirect feedback on the search results

• Click tracking

• “More like this” or “Find similar”

• Clustering

13

Structured Search

• Designers use patterns in search behavior to guess user’s intent;
this requires a substantial understanding of user behavior;
it may require structured content (although, not necessarily)

Examples

• Zip Code -> Zip Code Lookup Tool

• Person’s name -> Directory Listing

• Product Name -> Shop or Support?

• Address -> Map this?

• Topic -> Introduction, Forms, Policies or Reports?

14

Controlled Vocabularies

• Classiﬁcation with a controlled vocabulary is the best way to ensure 100%
Recall

• Lead-in synonyms
enter “fridge”; get “refrigerator” instead;
best if the collection is well-cataloged
increases precision (e.g. in a library)

• Term-expansion synonyms;
enter “refrigerator”; get “fridge” too;
best if the collection is not well-cataloged
increases recall at the cost of precision (e.g on eBay)

• Spell check on query phrases

15

Why is search
important?

IF:
About half of all users prefer to
search ﬁrst*

THEN:
What percentage of a content
site’s development effort should
be devoted to search?

* This statistic is highly context-dependent. People’s
behavior depends on the context of their actions.
The stat is from Jared Spool.

16

Questions?
James Melzer
Information Architect
SRA International
james_melzer@sra.com

17

Information Retrieval (for beginners)

Recommended

Recommended

More Related Content

Similar to Information Retrieval (for beginners)

Similar to Information Retrieval (for beginners) (20)

Recently uploaded

Recently uploaded (20)

Information Retrieval (for beginners)