3. The basics of search
• A search engine mediates between user’s query and metadata surrogates for
documents
• Documents are reduced to metadata
• User’s need is translated into a query
• Query terms are used to find matching metadata terms
• Lots and lots of room for error...
3
4. The search process
1. Crawl content for metadata
2. Index document terms into an inverted file;
an inverted file is very fast to search
3. Search the index to identify the result set;
search the index - not the documents
4. Rank the results for display;
ranking is the hardest part
4
5. Search algorithm 1
Term-based Ranking (tf/idf)
• tf = term frequency
documents that use the query terms most are presumed to be most relevant
• idf = inverse document frequency
terms that are more rare are better indicators of relevance
• Assumptions
1) relevance can be measured with document terms
5
6. Search algorithm 2
PageRank (Google)
• Relevant set is still identified by term matching
• A revolution in ranking:
based on linking between documents
• Assumptions:
1) important sites link to other important sites
2) if many people link to a site, it is important
6
7. Citation Analysis
• Authors carefully select articles to cite
• The more citations an article gets,
the better it must be
• Citations by authors who have a lot of citations confers their power to those
they cite
• Aggregate and leverage all these small individual decisions...
7
8. How Complex is
Google?
Google has about
36 ranking algorithms
Examples:
Citation Analysis
Statistical Clustering
Parsing Document Structure
Parsing Data in the Document
Microcontent Parsing
8
10. Evaluating Search
Recall
the percentage of all relevant documents retrieved
100% recall means every relevant document is retrieved
Precision
the percentage of documents retrieved that are relevant
100% precision means only relevant documents are retrieved
10
11. Thoughts & Reservations about Evaluating Search
• Precision and Recall are usually inversely proportional, so improving one often
reduces the other.
• Given a corpus of content like the web (tens of billions of items)...
Recall is unmeasurable, and thus essentially meaningless
• What is relevance?
• Measuring Precision depends on an agreed definition of relevance, which is
tricky (human cataloging is only about 80% ‘accurate’ - relevance is very hard
to quantify)
12. Zipf
Best Bets
• Manually selected results, tied to specific query terms or phrases
• User-driven phrases
select the most-used phrases from search traffic;
go for easy wins, because returns diminish sharply
• Business-driven phrases
select phrases important to the business;
such as product names or office locations;
or politically sensitive phrases, so you can control the message people see
12
13. Relevance Feedback
• The user provides direct or indirect feedback on the search results
• Click tracking
• “More like this” or “Find similar”
• Clustering
13
14. Structured Search
• Designers use patterns in search behavior to guess user’s intent;
this requires a substantial understanding of user behavior;
it may require structured content (although, not necessarily)
Examples
• Zip Code -> Zip Code Lookup Tool
• Person’s name -> Directory Listing
• Product Name -> Shop or Support?
• Address -> Map this?
• Topic -> Introduction, Forms, Policies or Reports?
14
15. Controlled Vocabularies
• Classification with a controlled vocabulary is the best way to ensure 100%
Recall
• Lead-in synonyms
enter “fridge”; get “refrigerator” instead;
best if the collection is well-cataloged
increases precision (e.g. in a library)
• Term-expansion synonyms;
enter “refrigerator”; get “fridge” too;
best if the collection is not well-cataloged
increases recall at the cost of precision (e.g on eBay)
• Spell check on query phrases
15
16. Why is search
important?
IF:
About half of all users prefer to
search first*
THEN:
What percentage of a content
site’s development effort should
be devoted to search?
* This statistic is highly context-dependent. People’s
behavior depends on the context of their actions.
The stat is from Jared Spool.
16