SlideShare ist ein Scribd-Unternehmen logo
1 von 22
Downloaden Sie, um offline zu lesen
being google
 tom dyson
V.
metadata is easy
language is hard
Our Corpus:
 1. The cow says moo.
2. The sheep says baa.
3. The dogs say woof.
 4. The dog-cow says
         moof.
>>>   doc1   =   quot;The   cow says moo.quot;
>>>   doc2   =   quot;The   sheep says baa.quot;
>>>   doc3   =   quot;The   dogs say woof.quot;
>>>   doc4   =   quot;The   dog-cow says moof.quot;
Brute force
>>> docs = [doc1, doc2, doc3, doc4]

>>> def searcher(term):
...   for doc in docs:
...     if doc.find(term) > -1:
...       print quot;found '%s' in '%s'quot; % (term, doc)
...
>>> searcher('moo')
found 'moo' in 'The cow says moo.'
my first inverted index
Tokenising #1
>>> doc1.split()
['The', 'cow', 'says', 'moo.']
Tokenising #2
>>> import re
>>> word = re.compile('W+')
>>> word.split(doc1)
['The', 'cow', 'says', 'moo', '']

>>> doc4 = quot;The dog-cow says moofquot;
>>> word.split(doc4)
['The', 'dog', 'cow', 'says', 'moof']
Tokenising #3

>>> word = re.compile('s|[^a-z-]', re.I)
>>> word.split(doc4)
['The', 'dog-cow', 'says', 'moof', '']
Data structures
>>>   doc1   =   {'name':'doc   1',   'content':quot;The   cow says moo.quot;}
>>>   doc2   =   {'name':'doc   2',   'content':quot;The   sheep says baa.quot;}
>>>   doc3   =   {'name':'doc   3',   'content':quot;The   dogs say woof.quot;}
>>>   doc4   =   {'name':'doc   4',   'content':quot;The   dog-cow says moof.quot;}
Postings
>>> postings = {}

>>> for doc in docs:
...   for token in word.split(doc['content']):
...     if len(token) == 0: break
...     doc_name = doc['name']
...     if token not in postings:
...       postings[token.lower()] = [doc_name]
...     else:
...       postings[token.lower()].append(doc_name)
Postings
>>> postings
{'sheep': ['doc 2'], 'says': ['doc 1', 'doc 2',
'doc 4'], 'cow': ['doc 1'], 'moof': ['doc 4'],
'dog-cow': ['doc 4'], 'woof': ['doc 3'], 'say':
['doc 3'], 'moo': ['doc 1'], 'baa': ['doc 2'],
'The': ['doc 1', 'doc 2', 'doc 3', 'doc 4'],
'dogs': ['doc 3']}
O(log n)
>>> def searcher(term):
...   if term in postings:
...     for match in postings[term]:
...       print quot;found '%s' in '%s'quot; % (term, match)
...
>>> searcher('says')
found 'says' in 'doc 1'
found 'says' in 'doc 2'
found 'says' in 'doc 4'
More postings
‘sheep’: [‘doc 2’, [2]]
‘says’: [‘doc 1’, [3], ‘doc 2’, [3], ‘doc 4’, [3]]
and more postings
‘sheep’: [‘doc 2’, [‘field’: ‘body’], 2]]
‘google’: [‘intro’, [‘field’: ‘title’], 2]]
tokenising #3
  Punctuation
   Stemming
   Stop words
 Parts of Speech
Entity Extraction
     Markup
Logistics
          Storage
(serialising, transporting,
        clustering)
          Updates
       Warming up
ranking
   Density
    (tf–idf)
   Position
      Date
Relationships
  Feedback
   Editorial
interesting search
      Lucene
(Hadoop, Solr, Nutch)
  OpenFTS / MySQL
       Sphinx
   Hyper Estraier
      Xapian
  Other index types
being google
 tom dyson

Weitere ähnliche Inhalte

Was ist angesagt?

MongoDB: How it Works
MongoDB: How it WorksMongoDB: How it Works
MongoDB: How it WorksMike Dirolf
 
Intro to MongoDB and datamodeling
Intro to MongoDB and datamodeling Intro to MongoDB and datamodeling
Intro to MongoDB and datamodeling rogerbodamer
 
Groovy ネタ NGK 忘年会2009 ライトニングトーク
Groovy ネタ NGK 忘年会2009 ライトニングトークGroovy ネタ NGK 忘年会2009 ライトニングトーク
Groovy ネタ NGK 忘年会2009 ライトニングトークTsuyoshi Yamamoto
 
RESTing with the new Yandex.Disk API, Clemens Аuer
RESTing with the new Yandex.Disk API, Clemens АuerRESTing with the new Yandex.Disk API, Clemens Аuer
RESTing with the new Yandex.Disk API, Clemens АuerYandex
 
Ruby Language: Array, Hash and Iterators
Ruby Language: Array, Hash and IteratorsRuby Language: Array, Hash and Iterators
Ruby Language: Array, Hash and IteratorsSarah Allen
 
MongoDB Europe 2016 - ETL for Pros – Getting Data Into MongoDB The Right Way
MongoDB Europe 2016 - ETL for Pros – Getting Data Into MongoDB The Right WayMongoDB Europe 2016 - ETL for Pros – Getting Data Into MongoDB The Right Way
MongoDB Europe 2016 - ETL for Pros – Getting Data Into MongoDB The Right WayMongoDB
 
Presentasi mac'lc-02
Presentasi mac'lc-02Presentasi mac'lc-02
Presentasi mac'lc-02maman__
 
Presentasi Mac'LC
Presentasi Mac'LCPresentasi Mac'LC
Presentasi Mac'LCmaman__
 
コミュニケーションとしてのコード
コミュニケーションとしてのコードコミュニケーションとしてのコード
コミュニケーションとしてのコードAtsushi Shibata
 
CouchDB @ red dirt ruby conference
CouchDB @ red dirt ruby conferenceCouchDB @ red dirt ruby conference
CouchDB @ red dirt ruby conferenceleinweber
 
PuppetCamp SEA @ Blk 71 - Nagios in under 10 mins with Puppet
PuppetCamp SEA @ Blk 71 -  Nagios in under 10 mins with PuppetPuppetCamp SEA @ Blk 71 -  Nagios in under 10 mins with Puppet
PuppetCamp SEA @ Blk 71 - Nagios in under 10 mins with PuppetOlinData
 
PuppetCamp SEA @ Blk 71 - Nagios in under 10 mins with Puppet
PuppetCamp SEA @ Blk 71 -  Nagios in under 10 mins with PuppetPuppetCamp SEA @ Blk 71 -  Nagios in under 10 mins with Puppet
PuppetCamp SEA @ Blk 71 - Nagios in under 10 mins with PuppetWalter Heck
 
Dev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDBDev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDBMongoDB
 
Percona Live 4/15/15: Transparent sharding database virtualization engine (DVE)
Percona Live 4/15/15: Transparent sharding database virtualization engine (DVE)Percona Live 4/15/15: Transparent sharding database virtualization engine (DVE)
Percona Live 4/15/15: Transparent sharding database virtualization engine (DVE)Tesora
 

Was ist angesagt? (20)

Javantura v2 - Replication with MongoDB - what could go wrong... - Philipp Krenn
Javantura v2 - Replication with MongoDB - what could go wrong... - Philipp KrennJavantura v2 - Replication with MongoDB - what could go wrong... - Philipp Krenn
Javantura v2 - Replication with MongoDB - what could go wrong... - Philipp Krenn
 
MongoDB: How it Works
MongoDB: How it WorksMongoDB: How it Works
MongoDB: How it Works
 
Intro to MongoDB and datamodeling
Intro to MongoDB and datamodeling Intro to MongoDB and datamodeling
Intro to MongoDB and datamodeling
 
Groovy ネタ NGK 忘年会2009 ライトニングトーク
Groovy ネタ NGK 忘年会2009 ライトニングトークGroovy ネタ NGK 忘年会2009 ライトニングトーク
Groovy ネタ NGK 忘年会2009 ライトニングトーク
 
RESTing with the new Yandex.Disk API, Clemens Аuer
RESTing with the new Yandex.Disk API, Clemens АuerRESTing with the new Yandex.Disk API, Clemens Аuer
RESTing with the new Yandex.Disk API, Clemens Аuer
 
Ruby Language: Array, Hash and Iterators
Ruby Language: Array, Hash and IteratorsRuby Language: Array, Hash and Iterators
Ruby Language: Array, Hash and Iterators
 
MongoDB Europe 2016 - ETL for Pros – Getting Data Into MongoDB The Right Way
MongoDB Europe 2016 - ETL for Pros – Getting Data Into MongoDB The Right WayMongoDB Europe 2016 - ETL for Pros – Getting Data Into MongoDB The Right Way
MongoDB Europe 2016 - ETL for Pros – Getting Data Into MongoDB The Right Way
 
Presentasi mac'lc-02
Presentasi mac'lc-02Presentasi mac'lc-02
Presentasi mac'lc-02
 
Presentasi Mac'LC
Presentasi Mac'LCPresentasi Mac'LC
Presentasi Mac'LC
 
My_sql_with_php
My_sql_with_phpMy_sql_with_php
My_sql_with_php
 
はじめてのGroovy
はじめてのGroovyはじめてのGroovy
はじめてのGroovy
 
コミュニケーションとしてのコード
コミュニケーションとしてのコードコミュニケーションとしてのコード
コミュニケーションとしてのコード
 
Mongo db
Mongo dbMongo db
Mongo db
 
Codigos
CodigosCodigos
Codigos
 
CouchDB @ red dirt ruby conference
CouchDB @ red dirt ruby conferenceCouchDB @ red dirt ruby conference
CouchDB @ red dirt ruby conference
 
PuppetCamp SEA @ Blk 71 - Nagios in under 10 mins with Puppet
PuppetCamp SEA @ Blk 71 -  Nagios in under 10 mins with PuppetPuppetCamp SEA @ Blk 71 -  Nagios in under 10 mins with Puppet
PuppetCamp SEA @ Blk 71 - Nagios in under 10 mins with Puppet
 
PuppetCamp SEA @ Blk 71 - Nagios in under 10 mins with Puppet
PuppetCamp SEA @ Blk 71 -  Nagios in under 10 mins with PuppetPuppetCamp SEA @ Blk 71 -  Nagios in under 10 mins with Puppet
PuppetCamp SEA @ Blk 71 - Nagios in under 10 mins with Puppet
 
Dev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDBDev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDB
 
Percona Live 4/15/15: Transparent sharding database virtualization engine (DVE)
Percona Live 4/15/15: Transparent sharding database virtualization engine (DVE)Percona Live 4/15/15: Transparent sharding database virtualization engine (DVE)
Percona Live 4/15/15: Transparent sharding database virtualization engine (DVE)
 
Underscore
UnderscoreUnderscore
Underscore
 

Andere mochten auch

Personal Carbon Rationing
Personal Carbon RationingPersonal Carbon Rationing
Personal Carbon RationingTom Dyson
 
Wagtail - Pourquoi un nouveau CMS?
Wagtail - Pourquoi un nouveau CMS?Wagtail - Pourquoi un nouveau CMS?
Wagtail - Pourquoi un nouveau CMS?Tom Dyson
 
Hands on with Google App Engine
Hands on with Google App EngineHands on with Google App Engine
Hands on with Google App EngineTom Dyson
 
Dynamic Demand
Dynamic DemandDynamic Demand
Dynamic DemandTom Dyson
 
The mobile web
The mobile webThe mobile web
The mobile webTom Dyson
 
Wychwood CRAG launch
Wychwood CRAG launchWychwood CRAG launch
Wychwood CRAG launchTom Dyson
 
The Making of The Carbon Account
The Making of The Carbon AccountThe Making of The Carbon Account
The Making of The Carbon AccountTom Dyson
 

Andere mochten auch (7)

Personal Carbon Rationing
Personal Carbon RationingPersonal Carbon Rationing
Personal Carbon Rationing
 
Wagtail - Pourquoi un nouveau CMS?
Wagtail - Pourquoi un nouveau CMS?Wagtail - Pourquoi un nouveau CMS?
Wagtail - Pourquoi un nouveau CMS?
 
Hands on with Google App Engine
Hands on with Google App EngineHands on with Google App Engine
Hands on with Google App Engine
 
Dynamic Demand
Dynamic DemandDynamic Demand
Dynamic Demand
 
The mobile web
The mobile webThe mobile web
The mobile web
 
Wychwood CRAG launch
Wychwood CRAG launchWychwood CRAG launch
Wychwood CRAG launch
 
The Making of The Carbon Account
The Making of The Carbon AccountThe Making of The Carbon Account
The Making of The Carbon Account
 

Ähnlich wie Being Google

What I learned from Seven Languages in Seven Weeks (IPRUG)
What I learned from Seven Languages in Seven Weeks (IPRUG)What I learned from Seven Languages in Seven Weeks (IPRUG)
What I learned from Seven Languages in Seven Weeks (IPRUG)Kerry Buckley
 
Basics of Python programming (part 2)
Basics of Python programming (part 2)Basics of Python programming (part 2)
Basics of Python programming (part 2)Pedro Rodrigues
 
python chapter 1
python chapter 1python chapter 1
python chapter 1Raghu nath
 
Python chapter 2
Python chapter 2Python chapter 2
Python chapter 2Raghu nath
 
GE8151 Problem Solving and Python Programming
GE8151 Problem Solving and Python ProgrammingGE8151 Problem Solving and Python Programming
GE8151 Problem Solving and Python ProgrammingMuthu Vinayagam
 
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of WranglingPLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of WranglingPlotly
 
Learn 90% of Python in 90 Minutes
Learn 90% of Python in 90 MinutesLearn 90% of Python in 90 Minutes
Learn 90% of Python in 90 MinutesMatt Harrison
 
Pre-Bootcamp introduction to Elixir
Pre-Bootcamp introduction to ElixirPre-Bootcamp introduction to Elixir
Pre-Bootcamp introduction to ElixirPaweł Dawczak
 
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...amit kuraria
 
Python tutorial
Python tutorialPython tutorial
Python tutorialnazzf
 
Python tutorial
Python tutorialPython tutorial
Python tutorialShani729
 
Coming Out Of Your Shell - A Comparison of *Nix Shells
Coming Out Of Your Shell - A Comparison of *Nix ShellsComing Out Of Your Shell - A Comparison of *Nix Shells
Coming Out Of Your Shell - A Comparison of *Nix ShellsKel Cecil
 
Pyconie 2012
Pyconie 2012Pyconie 2012
Pyconie 2012Yaqi Zhao
 
How can I make it so my code works so my command line can look like -p (1).docx
How can I make it so my code works so my command line can look like -p (1).docxHow can I make it so my code works so my command line can look like -p (1).docx
How can I make it so my code works so my command line can look like -p (1).docxPaulntmMilleri
 
Desarrollando aplicaciones web en minutos
Desarrollando aplicaciones web en minutosDesarrollando aplicaciones web en minutos
Desarrollando aplicaciones web en minutosEdgar Suarez
 
Round PEG, Round Hole - Parsing Functionally
Round PEG, Round Hole - Parsing FunctionallyRound PEG, Round Hole - Parsing Functionally
Round PEG, Round Hole - Parsing FunctionallySean Cribbs
 
Python tutorial
Python tutorialPython tutorial
Python tutorialRajiv Risi
 

Ähnlich wie Being Google (20)

What I learned from Seven Languages in Seven Weeks (IPRUG)
What I learned from Seven Languages in Seven Weeks (IPRUG)What I learned from Seven Languages in Seven Weeks (IPRUG)
What I learned from Seven Languages in Seven Weeks (IPRUG)
 
Five
FiveFive
Five
 
Basics of Python programming (part 2)
Basics of Python programming (part 2)Basics of Python programming (part 2)
Basics of Python programming (part 2)
 
My First Ruby
My First RubyMy First Ruby
My First Ruby
 
python chapter 1
python chapter 1python chapter 1
python chapter 1
 
Python chapter 2
Python chapter 2Python chapter 2
Python chapter 2
 
GE8151 Problem Solving and Python Programming
GE8151 Problem Solving and Python ProgrammingGE8151 Problem Solving and Python Programming
GE8151 Problem Solving and Python Programming
 
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of WranglingPLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
 
Learn 90% of Python in 90 Minutes
Learn 90% of Python in 90 MinutesLearn 90% of Python in 90 Minutes
Learn 90% of Python in 90 Minutes
 
Pre-Bootcamp introduction to Elixir
Pre-Bootcamp introduction to ElixirPre-Bootcamp introduction to Elixir
Pre-Bootcamp introduction to Elixir
 
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...
 
Python tutorial
Python tutorialPython tutorial
Python tutorial
 
Python tutorial
Python tutorialPython tutorial
Python tutorial
 
Coming Out Of Your Shell - A Comparison of *Nix Shells
Coming Out Of Your Shell - A Comparison of *Nix ShellsComing Out Of Your Shell - A Comparison of *Nix Shells
Coming Out Of Your Shell - A Comparison of *Nix Shells
 
Pyconie 2012
Pyconie 2012Pyconie 2012
Pyconie 2012
 
7li7w devcon5
7li7w devcon57li7w devcon5
7li7w devcon5
 
How can I make it so my code works so my command line can look like -p (1).docx
How can I make it so my code works so my command line can look like -p (1).docxHow can I make it so my code works so my command line can look like -p (1).docx
How can I make it so my code works so my command line can look like -p (1).docx
 
Desarrollando aplicaciones web en minutos
Desarrollando aplicaciones web en minutosDesarrollando aplicaciones web en minutos
Desarrollando aplicaciones web en minutos
 
Round PEG, Round Hole - Parsing Functionally
Round PEG, Round Hole - Parsing FunctionallyRound PEG, Round Hole - Parsing Functionally
Round PEG, Round Hole - Parsing Functionally
 
Python tutorial
Python tutorialPython tutorial
Python tutorial
 

Kürzlich hochgeladen

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 

Kürzlich hochgeladen (20)

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 

Being Google

  • 2. V.
  • 5. Our Corpus: 1. The cow says moo. 2. The sheep says baa. 3. The dogs say woof. 4. The dog-cow says moof.
  • 6. >>> doc1 = quot;The cow says moo.quot; >>> doc2 = quot;The sheep says baa.quot; >>> doc3 = quot;The dogs say woof.quot; >>> doc4 = quot;The dog-cow says moof.quot;
  • 7. Brute force >>> docs = [doc1, doc2, doc3, doc4] >>> def searcher(term): ... for doc in docs: ... if doc.find(term) > -1: ... print quot;found '%s' in '%s'quot; % (term, doc) ... >>> searcher('moo') found 'moo' in 'The cow says moo.'
  • 10. Tokenising #2 >>> import re >>> word = re.compile('W+') >>> word.split(doc1) ['The', 'cow', 'says', 'moo', ''] >>> doc4 = quot;The dog-cow says moofquot; >>> word.split(doc4) ['The', 'dog', 'cow', 'says', 'moof']
  • 11. Tokenising #3 >>> word = re.compile('s|[^a-z-]', re.I) >>> word.split(doc4) ['The', 'dog-cow', 'says', 'moof', '']
  • 12. Data structures >>> doc1 = {'name':'doc 1', 'content':quot;The cow says moo.quot;} >>> doc2 = {'name':'doc 2', 'content':quot;The sheep says baa.quot;} >>> doc3 = {'name':'doc 3', 'content':quot;The dogs say woof.quot;} >>> doc4 = {'name':'doc 4', 'content':quot;The dog-cow says moof.quot;}
  • 13. Postings >>> postings = {} >>> for doc in docs: ... for token in word.split(doc['content']): ... if len(token) == 0: break ... doc_name = doc['name'] ... if token not in postings: ... postings[token.lower()] = [doc_name] ... else: ... postings[token.lower()].append(doc_name)
  • 14. Postings >>> postings {'sheep': ['doc 2'], 'says': ['doc 1', 'doc 2', 'doc 4'], 'cow': ['doc 1'], 'moof': ['doc 4'], 'dog-cow': ['doc 4'], 'woof': ['doc 3'], 'say': ['doc 3'], 'moo': ['doc 1'], 'baa': ['doc 2'], 'The': ['doc 1', 'doc 2', 'doc 3', 'doc 4'], 'dogs': ['doc 3']}
  • 15. O(log n) >>> def searcher(term): ... if term in postings: ... for match in postings[term]: ... print quot;found '%s' in '%s'quot; % (term, match) ... >>> searcher('says') found 'says' in 'doc 1' found 'says' in 'doc 2' found 'says' in 'doc 4'
  • 16. More postings ‘sheep’: [‘doc 2’, [2]] ‘says’: [‘doc 1’, [3], ‘doc 2’, [3], ‘doc 4’, [3]]
  • 17. and more postings ‘sheep’: [‘doc 2’, [‘field’: ‘body’], 2]] ‘google’: [‘intro’, [‘field’: ‘title’], 2]]
  • 18. tokenising #3 Punctuation Stemming Stop words Parts of Speech Entity Extraction Markup
  • 19. Logistics Storage (serialising, transporting, clustering) Updates Warming up
  • 20. ranking Density (tf–idf) Position Date Relationships Feedback Editorial
  • 21. interesting search Lucene (Hadoop, Solr, Nutch) OpenFTS / MySQL Sphinx Hyper Estraier Xapian Other index types