2. What is covered?
• A non-programmers introduction to:
• Why do we have search engines.
• How search works across a page, a book,
thousands of books, to millions of books.
• How to get a good search result.
3. Speaker Background
• 10+ years working on search engines
• Amazon, A9.com, Mechanical Turk,
Trusera.com
• 13 patents -- Helping people find anything
• Billions of dollars of revenue
• Millions of searches per hour
4. Have you searched a
book for your name?
• Wonder how many times your name was
mentioned in your High School yearbook?
• Find your name across all your High School
and college yearbooks?
• Which would be the “best result” if I
searched for your name in those
yearbooks?
6. Search engines not
taught before the web
• Not taught because there was no demand.
• Why no demand?
• Machines had 10-20MB of disk.
• $100 per MB of disk --> Disk quotas
• Limited networking --> Limited
information
7. What does 1 Megabyte
of space hold?
• Book Page -- 2.5 Kilobytes of text
• 1 Megabyte == 400 pages ~ 1 thick book
8. Is it worth it to store a
book ?
• If disk space cost $100 per MB it had
better be worth it!
• Copying a $20 book into a $100 of disk
space is not cost effective.
9. Why has Search grown
so quickly?
• Lots and lots of fantastically cheap disk
space!
10. Inexpensive Disk!
Cost per Megabyte of Disk Megabytes per dollar of disk
100 10000
75 7500
50 5000
25
2500
0
1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 0
1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008
11. So what happens?
• Information blossoms.
• Quotas are gone -- Never have to delete!
• Email
• Web Pages
• Books, Image data, Music
12. Demand for search
skyrocketed
• Cheaper disks == more data to search.
• More data means
• Demand better search techniques
• Different handling of items indexed.
• Better user interfaces
• Reminder: There is no magic in search!
13. How does search
work?
• Let’s run through a text search example
14. Simple Searching
• How do you search for the word “coyness”
in the following string:
• “Had we but world enough and time thy
coyness lady would be no crime.”
15. Find the first
“c”
coyness
to his coy mistress
had we but world enough
and time thy coyness lady
16. coyness
to his coy mistress
had we but world enough
and time thy coyness lady
17. coyness
to his coy mistress
had we but world enough
and time thy coyness lady
18. coyness
to his coy mistress
had we but world enough
and time thy coyness lady
19. coyness
to his coy mistress
had we but world enough
and time thy coyness lady
20. coyness
to his coy mistress
had we but world enough
and time thy coyness lady
21. coyness
to his coy mistress
had we but world enough
and time thy coyness lady
22. coyness
to his coy mistress
had we but world enough
and time thy coyness lady
23. coyness
to his coy mistress
had we but world enough
and time thy coyness lady
24. coyness
to his coy mistress
had we but world enough
and time thy coyness lady
25. No match.
coyness
to his coy mistress
had we but world enough
and time thy coyness lady
26. coyness
to his coy mistress
had we but world enough
and time thy coyness lady
27. coyness
to his coy mistress
had we but world enough
and time thy coyness lady
28. coyness
to his coy mistress
had we but world enough
and time thy coyness lady
29. coyness
to his coy mistress
had we but world enough
and time thy coyness lady
30. coyness
to his coy mistress
had we but world enough
and time thy coyness lady
31. coynes
to his coy mistress
s
had we but world enough
and time thy coyness lady
32. coyne
to his coy mistress
ss
had we but world enough
and time thy coyness lady
33. coyn
to his coy mistress
ess
had we but world enough
and time thy coyness lady
34. coy
to his coy mistress
ness
had we but world enough
and time thy coyness lady
35. co
to his coy mistress
yness
had we but world enough
and time thy coyness lady
36. c
to his coy mistress
oyness
had we but world enough
and time thy coyness lady
37. to his coy mistress
coyness
had we but world enough
and time thy coyness lady
38. to his coy mistress
coyness
had we but world enough
and time thy coyness lady
39. to his coy mistress
coyness
had we but world enough
and time thy coyness lady
40. to his coy mistress
coyness
had we but world enough
and time thy coyness lady
41. to his coy mistress
coyness
had we but world enough
and time thy coyness lady
42. to his coy mistress
coyness
had we but world enough
and time thy coyness lady
43. to his coy mistress
coyness
had we but world enough
and time thy coyness lady
44. to his coy mistress
coyness
had we but world enough
and time thy coyness lady
45. to his coy mistress
coyness
had we but world enough
and time thy coyness lady
46. to his coy mistress
coyness
had we but world enough
and time thy coyness lady
47. to his coy mistress
coyness
had we but world enough
and time thy coyness lady
48. to his coy mistress
coyness
had we but world enough
and time thy coyness lady
49. to his coy mistress
coyness
had we but world enough
and time thy coyness lady
50. to his coy mistress
coyness
had we but world enough
and time thy coyness lady
51. to his coy mistress
coyness
had we but world enough
and time thy coyness lady
52. to his coy mistress
coyness
had we but world enough
and time thy coyness lady
53. to his coy mistress
coyness
had we but world enough
and time thy coyness lady
54. to his coy mistress
coynes
had we but world enough
s
and time thy coyness lady
55. to his coy mistress
coyne
had we but world enough
ss
and time thy coyness lady
56. to his coy mistress
coyn
had we but world enough
ess
and time thy coyness lady
57. to his coy mistress
coy
had we but world enough
ness
and time thy coyness lady
58. to his coy mistress
co
had we but world enough
yness
and time thy coyness lady
59. to his coy mistress
c
had we but world enough
oyness
and time thy coyness lady
60. to his coy mistress
had we but world enough
coyness
and time thy coyness lady
61. to his coy mistress
had we but world enough
coyness
and time thy coyness lady
62. to his coy mistress
had we but world enough
coyness
and time thy coyness lady
63. to his coy mistress
had we but world enough
coyness
and time thy coyness lady
64. to his coy mistress
had we but world enough
coyness
and time thy coyness lady
65. to his coy mistress
had we but world enough
coyness
and time thy coyness lady
66. to his coy mistress
had we but world enough
coyness
and time thy coyness lady
67. to his coy mistress
had we but world enough
coyness
and time thy coyness lady
68. to his coy mistress
had we but world enough
coyness
and time thy coyness lady
69. to his coy mistress
had we but world enough
coyness
and time thy coyness lady
70. to his coy mistress
had we but world enough
coyness
and time thy coyness lady
71. to his coy mistress
had we but world enough
coyness
and time thy coyness lady
72. to his coy mistress
had we but world enough
coyness
and time thy coyness lady
73. to his coy mistress
had we but world enough
coyness
and time thy coyness lady
74. to his coy mistress
had we but world enough
coyness
and time thy coyness lady
75. to his coy mistress
had we but world enough
coyness
and time thy coyness lady
76. to his coy mistress
had we but world enough
coyness
and time thy coyness lady
77. to his coy mistress
had we but world enough
coyness
and time thy coyness lady
78. to his coy mistress
had we but world enough
coyness
and time thy coyness lady
79. Matched!
to his coy mistress
had we but world enough
coyness
and time thy coyness lady
80. Can we find it faster?
• Yes!
• Boyer-Moore-Horspool.
• Start searching from the end of the word
• If a character matches one in the word,
shift forward to the character.
81. coyness
to his coy mistress
had we but world enough
and time thy coyness lady
82. coyness
to his coy mistress
had we but world enough
and time thy coyness lady
83. No match, skip.
coyness
to his coy mistress
had we but world enough
and time thy coyness lady
84. coyne
to his coy mistress
ss
had we but world enough
and time thy coyness lady
85. to his coy mistress
coyness
had we but world enough
and time thy coyness lady
86. to his coy mistress
coyness
had we but world enough
and time thy coyness lady
87. to his coy mistress
coyness
had we but world enough
and time thy coyness lady
88. to his coy mistress
had we but world enough
coyness
and time thy coyness lady
89. Doesn’t match but C is
a letter in our word
to his coy mistress
had we but world enough
coyness
and time thy coyness lady
90. Jump 7 spaces
to his coy mistress
had we but world enough
coyness
and time thy coyness lady
91. to his coy mistress
had we but world enough
coyness
and time thy coyness lady
92. to his coy mistress
had we but world enough
coyness
and time thy coyness lady
93. to his coy mistress
had we but world enough
coyness
and time thy coyness lady
94. to his coy mistress
had we but world enough
coyness
and time thy coyness lady
95. Matched!
to his coy mistress
had we but world enough
coyness
and time thy coyness lady
96. Simple Search works!
• Naive algorithm can work quickly for
documents you have never seen before and
don’t want to bother keeping around.
• Boyer Moore Horspool works even faster
with a little extra overhead of building a
table
• But, what if I have extra disk space to store
a book and want to go even faster?
97. Build an index!
Image by Dan Taylor: http://www.flickr.com/photos/dantaylor/1145628275/
98. Indexes are not new
• Indexes created in the 10th century to find
words in books.
• Card catalogs in libraries provide indexes
to books.
• What is new is how much information can
be stored in a single place.
99. Indexing is simple
• For each word in a book
• Store which page in the book it is on.
100. Partial index
• a -- 1,2,3,4,5,6,7,8,9,10,....
• cat -- 20, 45, 56, 58, 93, 84, 85
• coyness -- 70, 152, 425
• hat -- 6, 10, 35, 58, 89,105
• in -- 1,2,3,4,5,6,7,8,9,10,....58,......
• the -- 1,2,3,4,5,6,7,8,9,10,....58,....
101. Indexes use more disk
space
• A complete index takes about 33% of the
text indexed.
• In 1984, that would be $133 in disk space
per book.
102. • In 2008, $133 is able to store and index 1
million books.
103. How do you search
with an index?
• Step 1: Pick the words you are looking for
from the index.
• Step 2: Return all the pages that the word
appears on.
104. Search for “coyness”
• a -- 1,2,3,4,5,6,7,8,9,10,....
• cat -- 20, 45, 56, 58, 93, 84, 85
• coyness -- 70, 152, 425
• hat -- 6, 10, 35, 58, 89,105
• in -- 1,2,3,4,5,6,7,8,9,10,....58,......
• the -- 1,2,3,4,5,6,7,8,9,10,....58,....
105. Search for “coyness”
• a -- 1,2,3,4,5,6,7,8,9,10,....
• cat -- 20, 45, 56, 58, 93, 84, 85
• coyness -- 70, 152, 425
• hat -- 6, 10, 35, 58, 89,105
• in -- 1,2,3,4,5,6,7,8,9,10,....58,......
• the -- 1,2,3,4,5,6,7,8,9,10,....58,....
106. Search for “coyness”
• a -- 1,2,3,4,5,6,7,8,9,10,....
• cat -- 20, 45, 56, 58, 93, 84, 85
• coyness -- 70, 152, 425
• hat -- 6, 10, 35, 58, 89,105
• in -- 1,2,3,4,5,6,7,8,9,10,....58,......
• the -- 1,2,3,4,5,6,7,8,9,10,....58,....
107. Search for “Cat in the
Hat”
• a -- 1,2,3,4,5,6,7,8,9,10,....
• cat -- 20, 45, 56, 58, 93, 84, 85
• coyness -- 70, 152, 425
• hat -- 6, 10, 35, 58, 89,105
• in -- 1,2,3,4,5,6,7,8,9,10,....58,......
• the -- 1,2,3,4,5,6,7,8,9,10,....58,....
108. Search for “Cat in the
Hat”
• a -- 1,2,3,4,5,6,7,8,9,10,....
• cat -- 20, 45, 56, 58, 93, 84, 85
• coyness -- 70, 152, 425
• hat -- 6, 10, 35, 58, 89,105
• in -- 1,2,3,4,5,6,7,8,9,10,....58,......
• the -- 1,2,3,4,5,6,7,8,9,10,....58,....
109. Phrase Search for “Cat
in the Hat”
• a -- page 1(3, 12, 15,18),2( 12, 54,56)....
• cat -- page 20(45), 56(5), 58(3), 93(23)....
• coyness -- 70(56, 82), 152(45), 425(12)
• hat -- 6, 10, 35, 58(6), 89,105
• in -- 1,2,3,4,5,6,7,8,9,10,....58(4),......
• the -- 1,2,3,4,5,6,7,8,9,10,....58(5),....
Added page position in ()
110. How about Searching
1000's of books?
• Leverage the same tools we used before
• Create an index over multiple books
• Perform a search returning books and
pages
111. Multiple books for “Cat
in the Hat”
• cat -- [Dr. Seuss] 20, 45, 56, 58, [Pet Health
Dictionary] 5, 25, 68
• hat -- [Harry Potter] 6, 92, [Dr. Seuss] 35,
58, 89,105
• in -- [Twilight]1,2,...[Dr. Seuss],1,2,3,...58,...
• the -- [Programming Perl] 1,2,3,4,5, .... [Dr.
Seuss]...58,....
Added Book titles in []
112. How do you search
Millions of Books?
• Similar to finding all the Aces in a deck of
cards.
• 1 person -- 30 seconds if deck is
unsorted
• 1 person -- 3 seconds if deck is sorted
• 26 people -- 1 second if each has 2 cards.
113. How do you search
Millions of Books?
Website
Search Service Search across
many machines
Query Collector and return best
results
Index Server
Index Server
Index Server
Index Server
Index Server
Index Server
Index Server
Book Indexes
114. Millions of books to
millions of customers.
Website
Search Service
Query Collector Query Collector Query Collector
Index Server Index Server
Index Server Index Server
Index Server
Index Server
Index Server Index Server
Index Server Index Server
Index Server
Index Server Index Server Index Server
Index Server
Index Server Index Server
Index Server Index Server
Index Server
Index Server
Index Server Index Server Index Server
115. Which is the best
result?
• Should a search for “cat in the hat” return:
• The book by Dr. Seuss,
• A book about all the Dr. Seuss books,
• A story where the mother reads the
story to their child?
• Did you get what the customer wanted?
116. Relevancy (It depends)
• TF/IDF -- Prefer results with rare words
versus results with common words
• Amazon -- Biases towards what people
are searching and buying recently.
• Google -- Biases towards user activity,
PageRank, and other factors.
• Depends on what the customer intends
and how they ask the question.
117. Last step: Get the text
snippet.
• You have searched across millions of books,
• You have found the “Best” books with the
words “cat in the hat”
• You have spent 50 msec across 100’s of
machines to get the right result.
• How do you find the “snippet” on the
page?
119. Get snippet using
simple search
• Fetch the book page from a different disk.
• Use a simple linear search like Naive or
Boyer-Moore to get snippet and
surrounding text.
• Simple techniques applied across more
machines.
120. Future Trends
• Disk space costs dropping --> More data
• More networked devices --> More sharing
• What would you do with:
• All the web on your cell phone
• All your family/friends instantly available
121. Just scratching the
surface.
• Lucene search engine -- Open source. How
to index and search results. http://
lucene.apache.org/
• Google --Presentations and research notes.
-- http://research.google.com/video.html
• http://www.searchenginehistory.com/