SlideShare ist ein Scribd-Unternehmen logo
1 von 22
Downloaden Sie, um offline zu lesen
PO Department
PEOPLE OPERATION’S
MONTHLY UPDATE
09/2019
1
CPU and memory efficient
spellchecker implementation in TIKI
2
Results for “iphone”
3
Results for “ipohne” without spellchecker
4
Results for “ipohne” with spellchecker
5
General approach
words, result = (tokenize(query), [])
for w in words:
candidates = generate_candidates(w)
best_c, best_score = (None, 0.)
for c in candidates:
score = spellchecker_score(w, c)
if score > best_score:
best_c, best_score = (c, score)
result.append(best_c)
6
Generate candidates
Generate all possible similar words:
- Need to define a measure of similarity - we use Damerau-Levenshtein distance
- It allows insertions, deletions, substitutions and transpositions of symbols
- We limit maximum allowed distance depending on the length of the word
- Then just generate all edits out of 4 possible types (CPU greedy)
- We will optimize this approach later
Examples of Damerau-Levenshtein distance:
- distance(nguyễn, nguyên) = 1 (one substitution)
- distance(nguyễn, nguyeenx) = 3 (one substitution, two insertions)
- distance(behaivour, behaviour) = 1 (one transposition)
7
Spellchecker score
“Noisy channel” model:
- Bayesian formula: P(c|w) = P(w|c) * P(c) / P(w)
- Need to find candidate c which maximizes P(c|w)
- Can simplify to P(w|c) * P(c) because P(w) is constant for all candidates
Used probabilities:
- P(c|w) - probability of c being intended when w was observed
- P(w|c) - probability of the word w to be a misspelling of c - error model
- P(c) - probability to observe c - language model
8
Building the language model
N-gram model:
- Building a 2-gram dictionary
- Remove 2-grams below a certain threshold
Used data:
- All product contents on Tiki
- All Tiki search queries for a year
- Some randomly crawled texts from the Vietnamese Web
- Total: 5.5Gb gzip-ed
9
Building the language model (example)
Data (queries on Tiki):
máy rửa mặt
máy rửa mắt
máy sấy tóc
máy xay tóc
máy xay sinh tố
máy rửa mắt
máy xay sinh tố
máy sấy tóc
...
máy sấy tóc
máy xay tóc
máy xay sinh tố
máy rửa mắt
máy rửa mắt
máy xay sinh tố
máy sấy tóc
Counted queries:
200 máy rửa mặt
5 máy rửa mắt
100 máy sấy tóc
5 máy xay tóc
100 máy xay sinh tố
10
Building the language model (example)
Counted queries:
200 máy rửa mặt
5 máy rửa mắt
100 máy sấy tóc
5 máy xay tóc
100 máy xay sinh tố
Language model:
410 <
410 >
410 máy
410 < máy
205 máy rửa
100 máy sấy
105 máy xay
105 tóc >
100 sấy tóc
5 xay tóc
105 tóc
...
We just count all possible single words and
word pairs from our counted queries data and
write it down into language model.
This will let us calculate the probability of the
word to be observed without a context or with
a context of 1 word before or after it.
11
Building the language model (example)
Language model:
410 <
410 >
410 máy
410 < máy
205 máy rửa
100 máy sấy
105 máy xay
105 tóc >
100 sấy tóc
5 xay tóc
105 tóc
...
Query: máy => “< máy >"
P(máy) = 0.5 * (P(< máy) + P(máy >))
= 0.5 * (410/410+0/410) = 0.5
Query: máy xay tóc
P(xay) = 0.5 * (P(máy xay) + P(xay tóc))
= 0.5 * (105/410+5/105) ~ 0.30
P(sấy) = 0.5 * (P(máy sấy) + P(sấy tóc))
= 0.5 * (100/410+100/105) ~ 0.60
Language model here suggests that the
probability to see “sấy” in this context is
higher than the probability to see “xay”.
12
Building the error model
Automatic extraction of P(w|c):
- Extract triplets (w1, w2, w3) from our texts set
- Group triplets by (w1, *, w3) and sort by descending popularity
- Remove groupings below a certain threshold
- Remove samples where w2 words are too far from each other (using
Damerau-Levenshtein distance)
- Remove samples with popularity comparable to the most popular sample in this
grouping
- Write w2 words from all left samples into error model mapping as triplets of
(observed word, intended word, count)
Used data:
- Same as for the language model
13
Building the error model (example)
Data (queries on Tiki):
máy rửa mặt
máy rửa mắt
máy sấy tóc
máy xay tóc
máy xay sinh tố
máy rửa mắt
máy xay sinh tố
máy sấy tóc
...
máy sấy tóc
máy xay tóc
máy xay sinh tố
máy rửa mắt
máy rửa mắt
máy xay sinh tố
máy sấy tóc
Counted queries:
200 máy rửa mặt
5 máy rửa mắt
100 máy sấy tóc
5 máy xay tóc
100 máy xay sinh tố
14
Building the error model (example)
Counted queries:
200 máy rửa mặt
5 máy rửa mắt
100 máy sấy tóc
5 máy xay tóc
100 máy xay sinh tố
Triplets:
205 < máy rửa
200 rửa mặt >
5 rửa mắt >
100 máy sấy tóc
5 máy xay tóc
200 máy rửa mặt
5 máy rửa mắt
105 < máy xay
100 sinh tố >
...
We count all possible triplets from our counted
queries data.
15
Building the error model (example)
Triplets (grouped):
rửa * >
200 rửa mặt >
5 rửa mắt >
máy * tóc
100 máy sấy tóc
5 máy xay tóc
máy * sinh
100 máy xay sinh
sinh * >
100 sinh tố >
...
Error model:
200 mặt mặt
5 mắt mặt
100 sấy sấy
5 xay sấy
100 xay xay
100 tố tố
...
Format:
count
observed_word
intended_word
16
Building the error model (example)
Query: kem rửa mắt
P(mắt|mắt) = 0/5 = 0.0 - we divide the number of
times “mắt" was intended when "mắt" was
observed in error model to just the total number of
times when "mắt" was observed in error model.
P(mắt|mặt) = 5/5 = 1.0 - again, we divide the
number of times "mặt" was intended when "mắt"
was observed in error model to just the total
number of times when "mắt" was observed in error
model.
This means that according to error model built
on our data, it is extremely likely for “mắt" to
be a misspelling of “mặt".
Error model:
200 mặt mặt
5 mắt mặt
100 sấy sấy
5 xay sấy
100 xay xay
100 tố tố
...
Format:
count
observed_word
intended_word
17
Quality optimizations
Idea:
- Language model is more important in bigger context
- Instead of P(w|c)*P(c) use P(w|c)*pow(P(c),lambda)
- Lambda depends on the length of available context
Results:
- Using bigger lambda for longer context => better test result (idea works!)
- For bigger N-gram need to use machine learning to optimize lambdas
18
Performance optimizations
Important fact:
It is possible to prove that if Damerau-Levenshtein distance(w, c) = N, then for any w
and c we can find a combination of no more than N deletes of a single character from
each side, which will lead to the same result. Examples below:
distance(iphone, iphobee) = 2 (one insertion, one substitution)
iphone -> iphoe VS iphobee -> iphoee -> iphoe (match!)
distance(iphone, pihoone) = 2 (one transposition, one insertion)
iphone -> ihone VS pihoone -> ihoone -> ihone (match!)
Let’s use it to optimize candidates generation!
19
Performance optimizations
Problem 1 - generating candidates is CPU greedy:
- Precompute “deletes” dictionary
- Use only delete operations from both sides
- Need to double-check the distance (can be up to 2N, but we need N)
- Fast, but requires RAM
Problem 2 - having “deletes” dictionary requires RAM:
- Use different data compression techniques
- From what we’ve tried, Judy dynamic arrays work the best
- We decreased RAM requirements from 10.5Gb to 2.3Gb
20
Testing results
Testing set:
- 5,000 random queries, 10,000 misspelled queries
- Suggestions collected through Google API and then manually checked
- Only one marker per query
Results:
- Slightly (10-12%) worse than Google (ok for such RAM requirements)
- In A/B test shows 3-9% purchases increase
21
Future plans
Implementation:
- Use 3-gram data (still trying to keep it RAM-optimal)
Testing:
- Use multi-marker test set
- Properly handle cases when spellchecker returns multiple variants
Thank you!
22

Weitere ähnliche Inhalte

Was ist angesagt?

Grokking Techtalk #37: Software design and refactoring
 Grokking Techtalk #37: Software design and refactoring Grokking Techtalk #37: Software design and refactoring
Grokking Techtalk #37: Software design and refactoringGrokking VN
 
Distributed Shared Memory Systems
Distributed Shared Memory SystemsDistributed Shared Memory Systems
Distributed Shared Memory SystemsArush Nagpal
 
MySQL NDB Cluster 8.0 SQL faster than NoSQL
MySQL NDB Cluster 8.0 SQL faster than NoSQL MySQL NDB Cluster 8.0 SQL faster than NoSQL
MySQL NDB Cluster 8.0 SQL faster than NoSQL Bernd Ocklin
 
Operating systems (For CBSE School Students)
Operating systems (For CBSE School Students)Operating systems (For CBSE School Students)
Operating systems (For CBSE School Students)Gaurav Aggarwal
 
The Basics of MongoDB
The Basics of MongoDBThe Basics of MongoDB
The Basics of MongoDBvaluebound
 
distributed Computing system model
distributed Computing system modeldistributed Computing system model
distributed Computing system modelHarshad Umredkar
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
Process & Thread Management
Process & Thread  ManagementProcess & Thread  Management
Process & Thread ManagementVpmv
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with PythonBenjamin Bengfort
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measuresankit_ppt
 

Was ist angesagt? (20)

Grokking Techtalk #37: Software design and refactoring
 Grokking Techtalk #37: Software design and refactoring Grokking Techtalk #37: Software design and refactoring
Grokking Techtalk #37: Software design and refactoring
 
Distributed Shared Memory Systems
Distributed Shared Memory SystemsDistributed Shared Memory Systems
Distributed Shared Memory Systems
 
MySQL NDB Cluster 8.0 SQL faster than NoSQL
MySQL NDB Cluster 8.0 SQL faster than NoSQL MySQL NDB Cluster 8.0 SQL faster than NoSQL
MySQL NDB Cluster 8.0 SQL faster than NoSQL
 
Unix case-study
Unix case-studyUnix case-study
Unix case-study
 
Operating systems (For CBSE School Students)
Operating systems (For CBSE School Students)Operating systems (For CBSE School Students)
Operating systems (For CBSE School Students)
 
Dining philosopher
Dining philosopherDining philosopher
Dining philosopher
 
Data Access Patterns
Data Access PatternsData Access Patterns
Data Access Patterns
 
HDFS Architecture
HDFS ArchitectureHDFS Architecture
HDFS Architecture
 
The Basics of MongoDB
The Basics of MongoDBThe Basics of MongoDB
The Basics of MongoDB
 
Cassandra Database
Cassandra DatabaseCassandra Database
Cassandra Database
 
distributed Computing system model
distributed Computing system modeldistributed Computing system model
distributed Computing system model
 
Servlet and concurrency
Servlet and concurrencyServlet and concurrency
Servlet and concurrency
 
Word embedding
Word embedding Word embedding
Word embedding
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
OS-Part-06.pdf
OS-Part-06.pdfOS-Part-06.pdf
OS-Part-06.pdf
 
Process & Thread Management
Process & Thread  ManagementProcess & Thread  Management
Process & Thread Management
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measures
 

Ähnlich wie Grokking TechTalk #35: Efficient spellchecking

Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 
c++ Data Types and Selection
c++ Data Types and Selectionc++ Data Types and Selection
c++ Data Types and SelectionAhmed Nobi
 
My app is secure... I think
My app is secure... I thinkMy app is secure... I think
My app is secure... I thinkWim Godden
 
Begin with Machine Learning
Begin with Machine LearningBegin with Machine Learning
Begin with Machine LearningNarong Intiruk
 
Spock Framework - Slidecast
Spock Framework - SlidecastSpock Framework - Slidecast
Spock Framework - SlidecastDaniel Kolman
 
My app is secure... I think
My app is secure... I thinkMy app is secure... I think
My app is secure... I thinkWim Godden
 
Railway Oriented Programming in Elixir
Railway Oriented Programming in ElixirRailway Oriented Programming in Elixir
Railway Oriented Programming in ElixirMustafa TURAN
 
Word embeddings as a service - PyData NYC 2015
Word embeddings as a service -  PyData NYC 2015Word embeddings as a service -  PyData NYC 2015
Word embeddings as a service - PyData NYC 2015François Scharffe
 
Network automation with Ansible and Python
Network automation with Ansible and PythonNetwork automation with Ansible and Python
Network automation with Ansible and PythonJisc
 
Django in the Office: Get Your Admin for Nothing and Your SQL for Free
Django in the Office: Get Your Admin for Nothing and Your SQL for FreeDjango in the Office: Get Your Admin for Nothing and Your SQL for Free
Django in the Office: Get Your Admin for Nothing and Your SQL for FreeHarvard Web Working Group
 
The why and how of moving to PHP 5.5/5.6
The why and how of moving to PHP 5.5/5.6The why and how of moving to PHP 5.5/5.6
The why and how of moving to PHP 5.5/5.6Wim Godden
 
My app is secure... I think
My app is secure... I thinkMy app is secure... I think
My app is secure... I thinkWim Godden
 
Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...
Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...
Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...Codemotion
 
Dialog Engine for Product Information
Dialog Engine for Product InformationDialog Engine for Product Information
Dialog Engine for Product InformationVamsee Chamakura
 
Testing Adhearsion Applications
Testing Adhearsion ApplicationsTesting Adhearsion Applications
Testing Adhearsion ApplicationsLuca Pradovera
 
Logical Expressions in C/C++. Mistakes Made by Professionals
Logical Expressions in C/C++. Mistakes Made by ProfessionalsLogical Expressions in C/C++. Mistakes Made by Professionals
Logical Expressions in C/C++. Mistakes Made by ProfessionalsPVS-Studio
 

Ähnlich wie Grokking TechTalk #35: Efficient spellchecking (20)

Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 
c++ Data Types and Selection
c++ Data Types and Selectionc++ Data Types and Selection
c++ Data Types and Selection
 
My app is secure... I think
My app is secure... I thinkMy app is secure... I think
My app is secure... I think
 
Begin with Machine Learning
Begin with Machine LearningBegin with Machine Learning
Begin with Machine Learning
 
Spock Framework - Slidecast
Spock Framework - SlidecastSpock Framework - Slidecast
Spock Framework - Slidecast
 
Spock Framework
Spock FrameworkSpock Framework
Spock Framework
 
My app is secure... I think
My app is secure... I thinkMy app is secure... I think
My app is secure... I think
 
Conf orm - explain
Conf orm - explainConf orm - explain
Conf orm - explain
 
Railway Oriented Programming in Elixir
Railway Oriented Programming in ElixirRailway Oriented Programming in Elixir
Railway Oriented Programming in Elixir
 
Word embeddings as a service - PyData NYC 2015
Word embeddings as a service -  PyData NYC 2015Word embeddings as a service -  PyData NYC 2015
Word embeddings as a service - PyData NYC 2015
 
Network automation with Ansible and Python
Network automation with Ansible and PythonNetwork automation with Ansible and Python
Network automation with Ansible and Python
 
Django in the Office: Get Your Admin for Nothing and Your SQL for Free
Django in the Office: Get Your Admin for Nothing and Your SQL for FreeDjango in the Office: Get Your Admin for Nothing and Your SQL for Free
Django in the Office: Get Your Admin for Nothing and Your SQL for Free
 
The why and how of moving to PHP 5.5/5.6
The why and how of moving to PHP 5.5/5.6The why and how of moving to PHP 5.5/5.6
The why and how of moving to PHP 5.5/5.6
 
My app is secure... I think
My app is secure... I thinkMy app is secure... I think
My app is secure... I think
 
Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...
Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...
Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...
 
Dialog Engine for Product Information
Dialog Engine for Product InformationDialog Engine for Product Information
Dialog Engine for Product Information
 
Testing Adhearsion Applications
Testing Adhearsion ApplicationsTesting Adhearsion Applications
Testing Adhearsion Applications
 
Logical Expressions in C/C++. Mistakes Made by Professionals
Logical Expressions in C/C++. Mistakes Made by ProfessionalsLogical Expressions in C/C++. Mistakes Made by Professionals
Logical Expressions in C/C++. Mistakes Made by Professionals
 
Php optimization
Php optimizationPhp optimization
Php optimization
 
Php101
Php101Php101
Php101
 

Mehr von Grokking VN

Grokking Techtalk #45: First Principles Thinking
Grokking Techtalk #45: First Principles ThinkingGrokking Techtalk #45: First Principles Thinking
Grokking Techtalk #45: First Principles ThinkingGrokking VN
 
Grokking Techtalk #42: Engineering challenges on building data platform for M...
Grokking Techtalk #42: Engineering challenges on building data platform for M...Grokking Techtalk #42: Engineering challenges on building data platform for M...
Grokking Techtalk #42: Engineering challenges on building data platform for M...Grokking VN
 
Grokking Techtalk #43: Payment gateway demystified
Grokking Techtalk #43: Payment gateway demystifiedGrokking Techtalk #43: Payment gateway demystified
Grokking Techtalk #43: Payment gateway demystifiedGrokking VN
 
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platformGrokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platformGrokking VN
 
Grokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: Gossip protocol and applicationsGrokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: Gossip protocol and applicationsGrokking VN
 
Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
 Grokking Techtalk #39: How to build an event driven architecture with Kafka ... Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
Grokking Techtalk #39: How to build an event driven architecture with Kafka ...Grokking VN
 
Grokking Techtalk #38: Escape Analysis in Go compiler
 Grokking Techtalk #38: Escape Analysis in Go compiler Grokking Techtalk #38: Escape Analysis in Go compiler
Grokking Techtalk #38: Escape Analysis in Go compilerGrokking VN
 
Grokking Techtalk #37: Data intensive problem
 Grokking Techtalk #37: Data intensive problem Grokking Techtalk #37: Data intensive problem
Grokking Techtalk #37: Data intensive problemGrokking VN
 
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
 Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer... Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...Grokking VN
 
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...Grokking VN
 
Grokking TechTalk #31: Asynchronous Communications
Grokking TechTalk #31: Asynchronous CommunicationsGrokking TechTalk #31: Asynchronous Communications
Grokking TechTalk #31: Asynchronous CommunicationsGrokking VN
 
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at ScaleGrokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at ScaleGrokking VN
 
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking VN
 
Grokking TechTalk #27: Optimal Binary Search Tree
Grokking TechTalk #27: Optimal Binary Search TreeGrokking TechTalk #27: Optimal Binary Search Tree
Grokking TechTalk #27: Optimal Binary Search TreeGrokking VN
 
Grokking TechTalk #26: Kotlin, Understand the Magic
Grokking TechTalk #26: Kotlin, Understand the MagicGrokking TechTalk #26: Kotlin, Understand the Magic
Grokking TechTalk #26: Kotlin, Understand the MagicGrokking VN
 
Grokking TechTalk #26: Compare ios and android platform
Grokking TechTalk #26: Compare ios and android platformGrokking TechTalk #26: Compare ios and android platform
Grokking TechTalk #26: Compare ios and android platformGrokking VN
 
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...Grokking VN
 
Grokking TechTalk #24: Kafka's principles and protocols
Grokking TechTalk #24: Kafka's principles and protocolsGrokking TechTalk #24: Kafka's principles and protocols
Grokking TechTalk #24: Kafka's principles and protocolsGrokking VN
 
Grokking TechTalk #21: Deep Learning in Computer Vision
Grokking TechTalk #21: Deep Learning in Computer VisionGrokking TechTalk #21: Deep Learning in Computer Vision
Grokking TechTalk #21: Deep Learning in Computer VisionGrokking VN
 
Grokking TechTalk #20: PostgreSQL Internals 101
Grokking TechTalk #20: PostgreSQL Internals 101Grokking TechTalk #20: PostgreSQL Internals 101
Grokking TechTalk #20: PostgreSQL Internals 101Grokking VN
 

Mehr von Grokking VN (20)

Grokking Techtalk #45: First Principles Thinking
Grokking Techtalk #45: First Principles ThinkingGrokking Techtalk #45: First Principles Thinking
Grokking Techtalk #45: First Principles Thinking
 
Grokking Techtalk #42: Engineering challenges on building data platform for M...
Grokking Techtalk #42: Engineering challenges on building data platform for M...Grokking Techtalk #42: Engineering challenges on building data platform for M...
Grokking Techtalk #42: Engineering challenges on building data platform for M...
 
Grokking Techtalk #43: Payment gateway demystified
Grokking Techtalk #43: Payment gateway demystifiedGrokking Techtalk #43: Payment gateway demystified
Grokking Techtalk #43: Payment gateway demystified
 
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platformGrokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
 
Grokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: Gossip protocol and applicationsGrokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: Gossip protocol and applications
 
Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
 Grokking Techtalk #39: How to build an event driven architecture with Kafka ... Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
 
Grokking Techtalk #38: Escape Analysis in Go compiler
 Grokking Techtalk #38: Escape Analysis in Go compiler Grokking Techtalk #38: Escape Analysis in Go compiler
Grokking Techtalk #38: Escape Analysis in Go compiler
 
Grokking Techtalk #37: Data intensive problem
 Grokking Techtalk #37: Data intensive problem Grokking Techtalk #37: Data intensive problem
Grokking Techtalk #37: Data intensive problem
 
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
 Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer... Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
 
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
 
Grokking TechTalk #31: Asynchronous Communications
Grokking TechTalk #31: Asynchronous CommunicationsGrokking TechTalk #31: Asynchronous Communications
Grokking TechTalk #31: Asynchronous Communications
 
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at ScaleGrokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
 
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
 
Grokking TechTalk #27: Optimal Binary Search Tree
Grokking TechTalk #27: Optimal Binary Search TreeGrokking TechTalk #27: Optimal Binary Search Tree
Grokking TechTalk #27: Optimal Binary Search Tree
 
Grokking TechTalk #26: Kotlin, Understand the Magic
Grokking TechTalk #26: Kotlin, Understand the MagicGrokking TechTalk #26: Kotlin, Understand the Magic
Grokking TechTalk #26: Kotlin, Understand the Magic
 
Grokking TechTalk #26: Compare ios and android platform
Grokking TechTalk #26: Compare ios and android platformGrokking TechTalk #26: Compare ios and android platform
Grokking TechTalk #26: Compare ios and android platform
 
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
 
Grokking TechTalk #24: Kafka's principles and protocols
Grokking TechTalk #24: Kafka's principles and protocolsGrokking TechTalk #24: Kafka's principles and protocols
Grokking TechTalk #24: Kafka's principles and protocols
 
Grokking TechTalk #21: Deep Learning in Computer Vision
Grokking TechTalk #21: Deep Learning in Computer VisionGrokking TechTalk #21: Deep Learning in Computer Vision
Grokking TechTalk #21: Deep Learning in Computer Vision
 
Grokking TechTalk #20: PostgreSQL Internals 101
Grokking TechTalk #20: PostgreSQL Internals 101Grokking TechTalk #20: PostgreSQL Internals 101
Grokking TechTalk #20: PostgreSQL Internals 101
 

Kürzlich hochgeladen

chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringmulugeta48
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptDineshKumar4165
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startQuintin Balsdon
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...SUHANI PANDEY
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 

Kürzlich hochgeladen (20)

chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 

Grokking TechTalk #35: Efficient spellchecking

  • 1. PO Department PEOPLE OPERATION’S MONTHLY UPDATE 09/2019 1 CPU and memory efficient spellchecker implementation in TIKI
  • 3. 3 Results for “ipohne” without spellchecker
  • 4. 4 Results for “ipohne” with spellchecker
  • 5. 5 General approach words, result = (tokenize(query), []) for w in words: candidates = generate_candidates(w) best_c, best_score = (None, 0.) for c in candidates: score = spellchecker_score(w, c) if score > best_score: best_c, best_score = (c, score) result.append(best_c)
  • 6. 6 Generate candidates Generate all possible similar words: - Need to define a measure of similarity - we use Damerau-Levenshtein distance - It allows insertions, deletions, substitutions and transpositions of symbols - We limit maximum allowed distance depending on the length of the word - Then just generate all edits out of 4 possible types (CPU greedy) - We will optimize this approach later Examples of Damerau-Levenshtein distance: - distance(nguyễn, nguyên) = 1 (one substitution) - distance(nguyễn, nguyeenx) = 3 (one substitution, two insertions) - distance(behaivour, behaviour) = 1 (one transposition)
  • 7. 7 Spellchecker score “Noisy channel” model: - Bayesian formula: P(c|w) = P(w|c) * P(c) / P(w) - Need to find candidate c which maximizes P(c|w) - Can simplify to P(w|c) * P(c) because P(w) is constant for all candidates Used probabilities: - P(c|w) - probability of c being intended when w was observed - P(w|c) - probability of the word w to be a misspelling of c - error model - P(c) - probability to observe c - language model
  • 8. 8 Building the language model N-gram model: - Building a 2-gram dictionary - Remove 2-grams below a certain threshold Used data: - All product contents on Tiki - All Tiki search queries for a year - Some randomly crawled texts from the Vietnamese Web - Total: 5.5Gb gzip-ed
  • 9. 9 Building the language model (example) Data (queries on Tiki): máy rửa mặt máy rửa mắt máy sấy tóc máy xay tóc máy xay sinh tố máy rửa mắt máy xay sinh tố máy sấy tóc ... máy sấy tóc máy xay tóc máy xay sinh tố máy rửa mắt máy rửa mắt máy xay sinh tố máy sấy tóc Counted queries: 200 máy rửa mặt 5 máy rửa mắt 100 máy sấy tóc 5 máy xay tóc 100 máy xay sinh tố
  • 10. 10 Building the language model (example) Counted queries: 200 máy rửa mặt 5 máy rửa mắt 100 máy sấy tóc 5 máy xay tóc 100 máy xay sinh tố Language model: 410 < 410 > 410 máy 410 < máy 205 máy rửa 100 máy sấy 105 máy xay 105 tóc > 100 sấy tóc 5 xay tóc 105 tóc ... We just count all possible single words and word pairs from our counted queries data and write it down into language model. This will let us calculate the probability of the word to be observed without a context or with a context of 1 word before or after it.
  • 11. 11 Building the language model (example) Language model: 410 < 410 > 410 máy 410 < máy 205 máy rửa 100 máy sấy 105 máy xay 105 tóc > 100 sấy tóc 5 xay tóc 105 tóc ... Query: máy => “< máy >" P(máy) = 0.5 * (P(< máy) + P(máy >)) = 0.5 * (410/410+0/410) = 0.5 Query: máy xay tóc P(xay) = 0.5 * (P(máy xay) + P(xay tóc)) = 0.5 * (105/410+5/105) ~ 0.30 P(sấy) = 0.5 * (P(máy sấy) + P(sấy tóc)) = 0.5 * (100/410+100/105) ~ 0.60 Language model here suggests that the probability to see “sấy” in this context is higher than the probability to see “xay”.
  • 12. 12 Building the error model Automatic extraction of P(w|c): - Extract triplets (w1, w2, w3) from our texts set - Group triplets by (w1, *, w3) and sort by descending popularity - Remove groupings below a certain threshold - Remove samples where w2 words are too far from each other (using Damerau-Levenshtein distance) - Remove samples with popularity comparable to the most popular sample in this grouping - Write w2 words from all left samples into error model mapping as triplets of (observed word, intended word, count) Used data: - Same as for the language model
  • 13. 13 Building the error model (example) Data (queries on Tiki): máy rửa mặt máy rửa mắt máy sấy tóc máy xay tóc máy xay sinh tố máy rửa mắt máy xay sinh tố máy sấy tóc ... máy sấy tóc máy xay tóc máy xay sinh tố máy rửa mắt máy rửa mắt máy xay sinh tố máy sấy tóc Counted queries: 200 máy rửa mặt 5 máy rửa mắt 100 máy sấy tóc 5 máy xay tóc 100 máy xay sinh tố
  • 14. 14 Building the error model (example) Counted queries: 200 máy rửa mặt 5 máy rửa mắt 100 máy sấy tóc 5 máy xay tóc 100 máy xay sinh tố Triplets: 205 < máy rửa 200 rửa mặt > 5 rửa mắt > 100 máy sấy tóc 5 máy xay tóc 200 máy rửa mặt 5 máy rửa mắt 105 < máy xay 100 sinh tố > ... We count all possible triplets from our counted queries data.
  • 15. 15 Building the error model (example) Triplets (grouped): rửa * > 200 rửa mặt > 5 rửa mắt > máy * tóc 100 máy sấy tóc 5 máy xay tóc máy * sinh 100 máy xay sinh sinh * > 100 sinh tố > ... Error model: 200 mặt mặt 5 mắt mặt 100 sấy sấy 5 xay sấy 100 xay xay 100 tố tố ... Format: count observed_word intended_word
  • 16. 16 Building the error model (example) Query: kem rửa mắt P(mắt|mắt) = 0/5 = 0.0 - we divide the number of times “mắt" was intended when "mắt" was observed in error model to just the total number of times when "mắt" was observed in error model. P(mắt|mặt) = 5/5 = 1.0 - again, we divide the number of times "mặt" was intended when "mắt" was observed in error model to just the total number of times when "mắt" was observed in error model. This means that according to error model built on our data, it is extremely likely for “mắt" to be a misspelling of “mặt". Error model: 200 mặt mặt 5 mắt mặt 100 sấy sấy 5 xay sấy 100 xay xay 100 tố tố ... Format: count observed_word intended_word
  • 17. 17 Quality optimizations Idea: - Language model is more important in bigger context - Instead of P(w|c)*P(c) use P(w|c)*pow(P(c),lambda) - Lambda depends on the length of available context Results: - Using bigger lambda for longer context => better test result (idea works!) - For bigger N-gram need to use machine learning to optimize lambdas
  • 18. 18 Performance optimizations Important fact: It is possible to prove that if Damerau-Levenshtein distance(w, c) = N, then for any w and c we can find a combination of no more than N deletes of a single character from each side, which will lead to the same result. Examples below: distance(iphone, iphobee) = 2 (one insertion, one substitution) iphone -> iphoe VS iphobee -> iphoee -> iphoe (match!) distance(iphone, pihoone) = 2 (one transposition, one insertion) iphone -> ihone VS pihoone -> ihoone -> ihone (match!) Let’s use it to optimize candidates generation!
  • 19. 19 Performance optimizations Problem 1 - generating candidates is CPU greedy: - Precompute “deletes” dictionary - Use only delete operations from both sides - Need to double-check the distance (can be up to 2N, but we need N) - Fast, but requires RAM Problem 2 - having “deletes” dictionary requires RAM: - Use different data compression techniques - From what we’ve tried, Judy dynamic arrays work the best - We decreased RAM requirements from 10.5Gb to 2.3Gb
  • 20. 20 Testing results Testing set: - 5,000 random queries, 10,000 misspelled queries - Suggestions collected through Google API and then manually checked - Only one marker per query Results: - Slightly (10-12%) worse than Google (ok for such RAM requirements) - In A/B test shows 3-9% purchases increase
  • 21. 21 Future plans Implementation: - Use 3-gram data (still trying to keep it RAM-optimal) Testing: - Use multi-marker test set - Properly handle cases when spellchecker returns multiple variants