SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Downloaden Sie, um offline zu lesen
Real-time Data De-duplication
using Locality-sensitive Hashing
powered by Storm and Riak
!
Dr. Stefan Schadwinkel
@ Berlin Buzzwords 2014
Dr. Stefan Schadwinkel
Co-Founder / Analytics Engineer
stefan.schadwinkel@deck36.de
– DECK36 is an ICANS Spin-Off

!
– Small team of core engineers

!
– Longstanding expertise in complex
web systems

!
– Developing own data intelligence-
focussed tools and web services

!
– Offering our expert knowledge in:

– Automation & Operations

– Architecure & Engineering

– Analytics & Data Logistics
DUPLICATE DATA
Why is that interesting?
Databases
Fraud
Fingerprinting
Sequence
Tracking
Clustering
in general
Velocity
GENERAL CLUSTERING
The Mighty and Terrible …
… Distance Matrix.
DUPLICATE DATA
Some comments…
Record linkage (RL)
… refers to the task of finding records in a data set that refer to the same entity across
different data sources (e.g., data files, books, websites, databases).
!
… joining data sets based on entities that may or may not share a common identifier
… as may be the case due to differences in record shape, storage location, and/or
curator style or preference.
!
… records do not share a common key […] Errors are introduced as the result of
transcription errors, incomplete information, lack of standard formats, or any
combination of these factors.
DUPLICATE DATA
Short recap.
Elmagarmid et.al. (2007): "Duplicate Record Detection: A Survey”
• The probabilistic foundations: Dunn (1946) and Newcombe et.al. (1959).
• Fellegi and Sunter (1969) formalized the probabilistic foundations into what remains the
mathematical foundation for many record linkage applications even today.
• Various machine learning techniques since the late 1990’s.
!
Basic Method
• For each feature select a match-type (character-based, token-based, phonetic, ..)
• For each feature, define a weight (importance)
• Calculate total match value (sum of weight * match result)
• Use machine learning to learn the weights
DUPLICATE DATA
Grrr, that terrible daemon!
Captain Obvious: It’s prohibitively expensive.
• Single step divide and conquer called “Blocking”
• Group data into “blocks” and perform the comparisons only within the blocks
• Multi-pass approaches
!
It’s artisan craftwork.
DISCUSSION
Meet Dedoop
Dedoop: It’s an environment for Deduplication Craftmanship
• It’s a research project from Leipzig University
• http://dbs.uni-leipzig.de/research/projects/large_scale_object_matching
!
Research Focus Topics
• Skew handling / Load balancing
• Because, you know, some blocks are bigger than others …
• Redundancy-free comparisons
• Multi-pass approaches lead to overlapping blocks
• Overlapping blocks lead to comparisons you already made…
!
DISCUSSION
Meet Dedoop
MULTI-INDEX LSH
The Algorithm
MULTI-INDEX LSH
Overview
Not my invention.
The method is from the Eventbrite Engineering Blog.
Kudos to Jay Chan! And all the other guys.
!
https://engineering.eventbrite.com/multi-index-locality-sensitive-hashing-for-fun-and-
profit/
!
Main points:
1. The entity representation is transformed using the MinHash algorithm.
2. The number of comparisons is reduced using an indexing scheme.
MULTI-INDEX LSH
Tokenization
Free Text
!
A: “The world is for you to get over with!”
➔ the, world, is, for, you, to, get, over, with
!
B: “Let’s take over the world!”
➔ lets, take, over, the, world
!
Jaccard Similarity Coefficient
Union Set of Tokens:	 {the, world, is, for, you, to, get, over, with, lets, take}
Intersection of Tokensets: 	 {the, world, over}
JSC = size of intersection / size of union = 3 / 11 ≈ 27%
MULTI-INDEX LSH
Jaccard Similarity by MinHash
Just count how many hashes match.
Approx. Jaccard = # of matching hashes / # hashes
!
A: Bag of 32 hashes
B: Bag of 32 hashes
!
Let’s say, 10 hashes are identical,
then the Jaccard Similarity is approximately 10 / 32
which is ~ 0.31
!
MULTI-INDEX LSH
MinHash
“hash each token. take the minimum. repeat N times.”
Transforms a bag of tokens to a bag of N hash values (32bit integers)
MULTI-INDEX LSH
Bit Sampling
Minhashing so far reduced the price per comparison, but we would still need to
compare all hash bags with all hash bags. But first, let’s reduce the price further.
!
1. We don’t need to know the exact hash, we only want to know if they match.
2. If one bit does not match, we know the two hashes can’t be identical.
!
➔ just keep the least significant bit. 32 hash values become one single 32 bit number.
!
Of course the LSB will be identical in 50% of the cases at random, but we can adjust for
that.
MULTI-INDEX LSH
Multi-Index “Blocking”
Norouzi et.al. (2012) “Fast Search in Hamming Space with Multi-Index Hashing"
http://www.cs.toronto.edu/~norouzi/research/mih/
https://github.com/norouzi/mih
!
Based on the “Pidgeonhole Principle”
If I put 10 items into 9 pigeonholes…
	 - then there must be at least one, that has >= 2 items.
	 - then there must be at least one, that has 0 or 1 item.
!
If I put N item into M containers…
	 - then there must be at least one, that has >= ceil(N/M) items	 [ceil(10/9) = 2]
	 - then there must be at least one, that has <= floor(N/M) items 	 [floor(10/9) = 1]
MULTI-INDEX LSH
Multi-Index “Blocking”
We can use that principle to build an index for our hash bitsamples.
32 hash functions ➔ One 32 bit sample
90% Match ➔ 28 bits must match ➔ up to 4 can be unequal (➔ N)
Split the 32 bit sample in 4 chunks of 8 bit ➔ at least one chunk has 0 or 1 unequal bit
!
These chunks will now become our index. Imagine a map:
1st chunk 	 ➔ [full 32 bit, …]
2nd chunk	 ➔ [full 32 bit, …]
3rd chunk	 ➔ [full 32 bit, …]
4th chunk	 ➔ [full 32 bit, …]
!
As you see, we trade space for time…
MULTI-INDEX LSH
Multi-Index “Blocking”
Our candidates are now those in the list behind the matching chunk.
Split the 32 bit sample in 4 chunks of 8 bit ➔ at least one chunk has 0 or 1 unequal bit
➔ All relevant candidates are behind those chunks
!
Lookup:
– Split bit sample into chunks
– Take all candidates behind one chunk and its bit distance siblings
– You can stop, once you have candidates based on one source chunk
– With all the candidates you then perform the final comparisons
!
MULTI-INDEX LSH
Multi-Index “Blocking”
Performance boundaries
M - total number of messages to compare
N - number of hash functions resp. bits in the sample
K - number of chunks the bit sample is split into
!
Insert
O(1) - compute the K chunks and store in map
!
Lookup
O(1) - compute the K chunks and lookup the candidate set
O(M * M / 2(N/K)) - compare every message (M) with all candidates (M / 2(N/K))
If we choose N and K so that 2(N/K) >> M, we can now achieve O(M), i.e linear scaling.
HANDS ON!
Storm and Riak.
HANDS ON!
Mail Grouping for Spam Detection
I wanted something else, but: Available data!
http://untroubled.org/spam/
“This directory contains all the spam that I have received since early 1998. I have
employed various "bait" addresses, … to trick email address harvesters into putting them
on spam lists.”
!
Spam mails from 2014
January + February: 85500 Mails
!
2(N/K) >> M
224 ~ 16 million messages
K - 8 chunks à 24 bit
N - 192 hash functions
SpamEmailSpout
Tokenizer
AddToLshTable
RiakLshIndexWriter
lsh_APP_GROUP_0
RiakLshIndexWriter
lsh_APP_GROUP_7
KEY
INDEX
VALUE
…
CHUNK0:CHUNK1:…|RELATION_ID
10000386:6442152:… |/Users/…/spam/2014/01/1388703495.txt
CHUNK0_VALUE ●	 == 10000386
HASH 	 	 == CHUNK0:CHUNK1:…:CHUNK7
RELATION_ID 	 == /Users/…/spam/2014/01/1388703495.txt
{“time_of_update” : 1395768557828, …}
…
Storm
Riak Buckets
DEMO
THE FUTURE
Some ideas.
THE FUTURE
Applications
Fingerprinting
Create tokens from data available with JavaScript and HTTP Request:
!
A: {“browser”:”Mozilla Firefox”,”plugins”:”Flash,Silverlight”,”zipcode”:20121}
➔ browser_mozilla, browser_firefox, plugin_flash, plugin_silverlight, zipcode_20121
!
B: {“browser”:”Google Chrome”,”plugins”:”Flash,Java”,”likes”:[123,124]}

➔ browser_google, browser_chrome, plugin_flash, plugin_java, like_123, like_124
!
!
API Server
Storm Cluster
+ DB
Q: Broker
JS-Client in
Browser
WS/SSE
THE FUTURE
Applications
Sequence Tracking
Sequence Code: step1_$URL, step2_$URL, …
Set Code: $URL1, $URL2, …
!
• Gets you the most common ‘fuzzy’ sequences or sets
• Variation: use a window for X successively visited sites, products, etc.
• Add features, i.e. sets of bought items
THE FUTURE
Applications
Fraud
Often identities are generated after some pattern with variations:
!
A: a_lastname32@yahoo.com
B: alex_lastname746@yahoo.co.uk
!
• Generally use redundant encoding using multiple encodings
• Remove whitespace, split & use q-grams
• Use phonetic encoding schemes
• Use special knowledge about email, address, etc. to create tags
• Use pre-filters and artisan blocking (i.e. major free mail providers)
– Low number of tokens
– Needs some research
Thank You.

Weitere ähnliche Inhalte

Was ist angesagt?

Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in SearchAmund Tveit
 
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
Андрей Козлов (Altoros): Оптимизация производительности CassandraАндрей Козлов (Altoros): Оптимизация производительности Cassandra
Андрей Козлов (Altoros): Оптимизация производительности CassandraOlga Lavrentieva
 
Wattpad - Spark Stories
Wattpad - Spark StoriesWattpad - Spark Stories
Wattpad - Spark StoriesRylan Halteman
 
Cloud-based Data Stream Processing
Cloud-based Data Stream ProcessingCloud-based Data Stream Processing
Cloud-based Data Stream ProcessingZbigniew Jerzak
 
RDF Stream Processing and the role of Semantics
RDF Stream Processing and the role of SemanticsRDF Stream Processing and the role of Semantics
RDF Stream Processing and the role of SemanticsJean-Paul Calbimonte
 
MongoDB Knowledge Shareing
MongoDB Knowledge ShareingMongoDB Knowledge Shareing
MongoDB Knowledge ShareingPhilip Zhong
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATLParikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATLMLconf
 
Hotcode 2013: Javascript in a database (Part 1)
Hotcode 2013: Javascript in a database (Part 1)Hotcode 2013: Javascript in a database (Part 1)
Hotcode 2013: Javascript in a database (Part 1)ArangoDB Database
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...MongoDB
 
RDF Stream Processing: Let's React
RDF Stream Processing: Let's ReactRDF Stream Processing: Let's React
RDF Stream Processing: Let's ReactJean-Paul Calbimonte
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
 
Query Rewriting in RDF Stream Processing
Query Rewriting in RDF Stream ProcessingQuery Rewriting in RDF Stream Processing
Query Rewriting in RDF Stream ProcessingJean-Paul Calbimonte
 
MongoDB San Francisco 2013: MongoDB for Collaborative Science presented by D...
MongoDB San Francisco 2013:  MongoDB for Collaborative Science presented by D...MongoDB San Francisco 2013:  MongoDB for Collaborative Science presented by D...
MongoDB San Francisco 2013: MongoDB for Collaborative Science presented by D...MongoDB
 
Cosequential processing and the sorting of large files
Cosequential processing and the sorting of large filesCosequential processing and the sorting of large files
Cosequential processing and the sorting of large filesDevyani Vaidya
 
Streaming kafka search utility for Mozilla's Bagheera
Streaming kafka search utility for Mozilla's BagheeraStreaming kafka search utility for Mozilla's Bagheera
Streaming kafka search utility for Mozilla's BagheeraVarunkumar Manohar
 
OrientDB: Unlock the Value of Document Data Relationships
OrientDB: Unlock the Value of Document Data RelationshipsOrientDB: Unlock the Value of Document Data Relationships
OrientDB: Unlock the Value of Document Data RelationshipsFabrizio Fortino
 

Was ist angesagt? (20)

Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in Search
 
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
Андрей Козлов (Altoros): Оптимизация производительности CassandraАндрей Козлов (Altoros): Оптимизация производительности Cassandra
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
 
Wattpad - Spark Stories
Wattpad - Spark StoriesWattpad - Spark Stories
Wattpad - Spark Stories
 
Cloud-based Data Stream Processing
Cloud-based Data Stream ProcessingCloud-based Data Stream Processing
Cloud-based Data Stream Processing
 
RDF Stream Processing and the role of Semantics
RDF Stream Processing and the role of SemanticsRDF Stream Processing and the role of Semantics
RDF Stream Processing and the role of Semantics
 
MongoDB Knowledge Shareing
MongoDB Knowledge ShareingMongoDB Knowledge Shareing
MongoDB Knowledge Shareing
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
User biglm
User biglmUser biglm
User biglm
 
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATLParikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
 
Hotcode 2013: Javascript in a database (Part 1)
Hotcode 2013: Javascript in a database (Part 1)Hotcode 2013: Javascript in a database (Part 1)
Hotcode 2013: Javascript in a database (Part 1)
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
 
RDF Stream Processing: Let's React
RDF Stream Processing: Let's ReactRDF Stream Processing: Let's React
RDF Stream Processing: Let's React
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Query Rewriting in RDF Stream Processing
Query Rewriting in RDF Stream ProcessingQuery Rewriting in RDF Stream Processing
Query Rewriting in RDF Stream Processing
 
Performance neo4j-versus (2)
Performance neo4j-versus (2)Performance neo4j-versus (2)
Performance neo4j-versus (2)
 
MongoDB
MongoDBMongoDB
MongoDB
 
MongoDB San Francisco 2013: MongoDB for Collaborative Science presented by D...
MongoDB San Francisco 2013:  MongoDB for Collaborative Science presented by D...MongoDB San Francisco 2013:  MongoDB for Collaborative Science presented by D...
MongoDB San Francisco 2013: MongoDB for Collaborative Science presented by D...
 
Cosequential processing and the sorting of large files
Cosequential processing and the sorting of large filesCosequential processing and the sorting of large files
Cosequential processing and the sorting of large files
 
Streaming kafka search utility for Mozilla's Bagheera
Streaming kafka search utility for Mozilla's BagheeraStreaming kafka search utility for Mozilla's Bagheera
Streaming kafka search utility for Mozilla's Bagheera
 
OrientDB: Unlock the Value of Document Data Relationships
OrientDB: Unlock the Value of Document Data RelationshipsOrientDB: Unlock the Value of Document Data Relationships
OrientDB: Unlock the Value of Document Data Relationships
 

Ähnlich wie Real-time Data De-duplication using Locality-sensitive Hashing powered by Storm and Riak (Berlin Buzzwords 2014)

Open LSH - september 2014 update
Open LSH  - september 2014 updateOpen LSH  - september 2014 update
Open LSH - september 2014 updateJ Singh
 
Binary Similarity : Theory, Algorithms and Tool Evaluation
Binary Similarity :  Theory, Algorithms and  Tool EvaluationBinary Similarity :  Theory, Algorithms and  Tool Evaluation
Binary Similarity : Theory, Algorithms and Tool EvaluationLiwei Ren任力偉
 
Detection of Related Semantic Datasets Based on Frequent Subgraph Mining
Detection of Related Semantic Datasets Based on Frequent Subgraph MiningDetection of Related Semantic Datasets Based on Frequent Subgraph Mining
Detection of Related Semantic Datasets Based on Frequent Subgraph MiningMikel Emaldi Manrique
 
Agile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics ApplicationsAgile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics ApplicationsRussell Jurney
 
Agile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsAgile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsRussell Jurney
 
OpenLSH - a framework for locality sensitive hashing
OpenLSH  - a framework for locality sensitive hashingOpenLSH  - a framework for locality sensitive hashing
OpenLSH - a framework for locality sensitive hashingJ Singh
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014The Hive
 
Big data-science-oanyc
Big data-science-oanycBig data-science-oanyc
Big data-science-oanycOpen Analytics
 
BIG Data Science: A Path Forward
BIG Data Science:  A Path ForwardBIG Data Science:  A Path Forward
BIG Data Science: A Path ForwardDan Mallinger
 
Session 5.2 multi-core meta-blocking for big linked data
Session 5.2   multi-core meta-blocking for big linked dataSession 5.2   multi-core meta-blocking for big linked data
Session 5.2 multi-core meta-blocking for big linked datasemanticsconference
 
Agile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsAgile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsDataWorks Summit
 
Hashing and Hash Tables
Hashing and Hash TablesHashing and Hash Tables
Hashing and Hash Tablesadil raja
 
GraphChain
GraphChainGraphChain
GraphChainsopekmir
 
Data massage! databases scaled from one to one million nodes (ulf wendel)
Data massage! databases scaled from one to one million nodes (ulf wendel)Data massage! databases scaled from one to one million nodes (ulf wendel)
Data massage! databases scaled from one to one million nodes (ulf wendel)Zhang Bo
 
Apache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysApache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysDemi Ben-Ari
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and HadoopDonald Miner
 

Ähnlich wie Real-time Data De-duplication using Locality-sensitive Hashing powered by Storm and Riak (Berlin Buzzwords 2014) (20)

Oslo
OsloOslo
Oslo
 
Duplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy DatasetDuplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy Dataset
 
Open LSH - september 2014 update
Open LSH  - september 2014 updateOpen LSH  - september 2014 update
Open LSH - september 2014 update
 
Binary Similarity : Theory, Algorithms and Tool Evaluation
Binary Similarity :  Theory, Algorithms and  Tool EvaluationBinary Similarity :  Theory, Algorithms and  Tool Evaluation
Binary Similarity : Theory, Algorithms and Tool Evaluation
 
No sql - { If and Else }
No sql - { If and Else }No sql - { If and Else }
No sql - { If and Else }
 
Detection of Related Semantic Datasets Based on Frequent Subgraph Mining
Detection of Related Semantic Datasets Based on Frequent Subgraph MiningDetection of Related Semantic Datasets Based on Frequent Subgraph Mining
Detection of Related Semantic Datasets Based on Frequent Subgraph Mining
 
Agile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics ApplicationsAgile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics Applications
 
Agile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsAgile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics Applications
 
OpenLSH - a framework for locality sensitive hashing
OpenLSH  - a framework for locality sensitive hashingOpenLSH  - a framework for locality sensitive hashing
OpenLSH - a framework for locality sensitive hashing
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
 
Big data-science-oanyc
Big data-science-oanycBig data-science-oanyc
Big data-science-oanyc
 
BIG Data Science: A Path Forward
BIG Data Science:  A Path ForwardBIG Data Science:  A Path Forward
BIG Data Science: A Path Forward
 
Session 5.2 multi-core meta-blocking for big linked data
Session 5.2   multi-core meta-blocking for big linked dataSession 5.2   multi-core meta-blocking for big linked data
Session 5.2 multi-core meta-blocking for big linked data
 
Agile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsAgile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics Applications
 
Hashing and Hash Tables
Hashing and Hash TablesHashing and Hash Tables
Hashing and Hash Tables
 
Hadoop bank
Hadoop bankHadoop bank
Hadoop bank
 
GraphChain
GraphChainGraphChain
GraphChain
 
Data massage! databases scaled from one to one million nodes (ulf wendel)
Data massage! databases scaled from one to one million nodes (ulf wendel)Data massage! databases scaled from one to one million nodes (ulf wendel)
Data massage! databases scaled from one to one million nodes (ulf wendel)
 
Apache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysApache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - Panorays
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
 

Mehr von DECK36

Our Puppet Story (GUUG FFG 2015)
Our Puppet Story (GUUG FFG 2015)Our Puppet Story (GUUG FFG 2015)
Our Puppet Story (GUUG FFG 2015)DECK36
 
PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.DECK36
 
Effizientere WordPress-Plugin-Entwicklung mit Softwaretests
Effizientere WordPress-Plugin-Entwicklung mit SoftwaretestsEffizientere WordPress-Plugin-Entwicklung mit Softwaretests
Effizientere WordPress-Plugin-Entwicklung mit SoftwaretestsDECK36
 
Our Puppet Story (Linuxtag 2014)
Our Puppet Story (Linuxtag 2014)Our Puppet Story (Linuxtag 2014)
Our Puppet Story (Linuxtag 2014)DECK36
 
Our Puppet Story – Patterns and Learnings (sage@guug, March 2014)
Our Puppet Story – Patterns and Learnings (sage@guug, March 2014)Our Puppet Story – Patterns and Learnings (sage@guug, March 2014)
Our Puppet Story – Patterns and Learnings (sage@guug, March 2014)DECK36
 
Hyperdex - A closer look
Hyperdex - A closer lookHyperdex - A closer look
Hyperdex - A closer lookDECK36
 
Log everything! @DC13
Log everything! @DC13Log everything! @DC13
Log everything! @DC13DECK36
 

Mehr von DECK36 (7)

Our Puppet Story (GUUG FFG 2015)
Our Puppet Story (GUUG FFG 2015)Our Puppet Story (GUUG FFG 2015)
Our Puppet Story (GUUG FFG 2015)
 
PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.
 
Effizientere WordPress-Plugin-Entwicklung mit Softwaretests
Effizientere WordPress-Plugin-Entwicklung mit SoftwaretestsEffizientere WordPress-Plugin-Entwicklung mit Softwaretests
Effizientere WordPress-Plugin-Entwicklung mit Softwaretests
 
Our Puppet Story (Linuxtag 2014)
Our Puppet Story (Linuxtag 2014)Our Puppet Story (Linuxtag 2014)
Our Puppet Story (Linuxtag 2014)
 
Our Puppet Story – Patterns and Learnings (sage@guug, March 2014)
Our Puppet Story – Patterns and Learnings (sage@guug, March 2014)Our Puppet Story – Patterns and Learnings (sage@guug, March 2014)
Our Puppet Story – Patterns and Learnings (sage@guug, March 2014)
 
Hyperdex - A closer look
Hyperdex - A closer lookHyperdex - A closer look
Hyperdex - A closer look
 
Log everything! @DC13
Log everything! @DC13Log everything! @DC13
Log everything! @DC13
 

Kürzlich hochgeladen

Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 

Kürzlich hochgeladen (20)

Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 

Real-time Data De-duplication using Locality-sensitive Hashing powered by Storm and Riak (Berlin Buzzwords 2014)

  • 1. Real-time Data De-duplication using Locality-sensitive Hashing powered by Storm and Riak ! Dr. Stefan Schadwinkel @ Berlin Buzzwords 2014
  • 2. Dr. Stefan Schadwinkel Co-Founder / Analytics Engineer stefan.schadwinkel@deck36.de – DECK36 is an ICANS Spin-Off ! – Small team of core engineers ! – Longstanding expertise in complex web systems ! – Developing own data intelligence- focussed tools and web services ! – Offering our expert knowledge in: – Automation & Operations – Architecure & Engineering – Analytics & Data Logistics
  • 3. DUPLICATE DATA Why is that interesting?
  • 5. GENERAL CLUSTERING The Mighty and Terrible … … Distance Matrix.
  • 6.
  • 7. DUPLICATE DATA Some comments… Record linkage (RL) … refers to the task of finding records in a data set that refer to the same entity across different data sources (e.g., data files, books, websites, databases). ! … joining data sets based on entities that may or may not share a common identifier … as may be the case due to differences in record shape, storage location, and/or curator style or preference. ! … records do not share a common key […] Errors are introduced as the result of transcription errors, incomplete information, lack of standard formats, or any combination of these factors.
  • 8. DUPLICATE DATA Short recap. Elmagarmid et.al. (2007): "Duplicate Record Detection: A Survey” • The probabilistic foundations: Dunn (1946) and Newcombe et.al. (1959). • Fellegi and Sunter (1969) formalized the probabilistic foundations into what remains the mathematical foundation for many record linkage applications even today. • Various machine learning techniques since the late 1990’s. ! Basic Method • For each feature select a match-type (character-based, token-based, phonetic, ..) • For each feature, define a weight (importance) • Calculate total match value (sum of weight * match result) • Use machine learning to learn the weights
  • 9. DUPLICATE DATA Grrr, that terrible daemon! Captain Obvious: It’s prohibitively expensive. • Single step divide and conquer called “Blocking” • Group data into “blocks” and perform the comparisons only within the blocks • Multi-pass approaches ! It’s artisan craftwork.
  • 10. DISCUSSION Meet Dedoop Dedoop: It’s an environment for Deduplication Craftmanship • It’s a research project from Leipzig University • http://dbs.uni-leipzig.de/research/projects/large_scale_object_matching ! Research Focus Topics • Skew handling / Load balancing • Because, you know, some blocks are bigger than others … • Redundancy-free comparisons • Multi-pass approaches lead to overlapping blocks • Overlapping blocks lead to comparisons you already made… !
  • 13. MULTI-INDEX LSH Overview Not my invention. The method is from the Eventbrite Engineering Blog. Kudos to Jay Chan! And all the other guys. ! https://engineering.eventbrite.com/multi-index-locality-sensitive-hashing-for-fun-and- profit/ ! Main points: 1. The entity representation is transformed using the MinHash algorithm. 2. The number of comparisons is reduced using an indexing scheme.
  • 14. MULTI-INDEX LSH Tokenization Free Text ! A: “The world is for you to get over with!” ➔ the, world, is, for, you, to, get, over, with ! B: “Let’s take over the world!” ➔ lets, take, over, the, world ! Jaccard Similarity Coefficient Union Set of Tokens: {the, world, is, for, you, to, get, over, with, lets, take} Intersection of Tokensets: {the, world, over} JSC = size of intersection / size of union = 3 / 11 ≈ 27%
  • 15. MULTI-INDEX LSH Jaccard Similarity by MinHash Just count how many hashes match. Approx. Jaccard = # of matching hashes / # hashes ! A: Bag of 32 hashes B: Bag of 32 hashes ! Let’s say, 10 hashes are identical, then the Jaccard Similarity is approximately 10 / 32 which is ~ 0.31 !
  • 16. MULTI-INDEX LSH MinHash “hash each token. take the minimum. repeat N times.” Transforms a bag of tokens to a bag of N hash values (32bit integers)
  • 17. MULTI-INDEX LSH Bit Sampling Minhashing so far reduced the price per comparison, but we would still need to compare all hash bags with all hash bags. But first, let’s reduce the price further. ! 1. We don’t need to know the exact hash, we only want to know if they match. 2. If one bit does not match, we know the two hashes can’t be identical. ! ➔ just keep the least significant bit. 32 hash values become one single 32 bit number. ! Of course the LSB will be identical in 50% of the cases at random, but we can adjust for that.
  • 18. MULTI-INDEX LSH Multi-Index “Blocking” Norouzi et.al. (2012) “Fast Search in Hamming Space with Multi-Index Hashing" http://www.cs.toronto.edu/~norouzi/research/mih/ https://github.com/norouzi/mih ! Based on the “Pidgeonhole Principle” If I put 10 items into 9 pigeonholes… - then there must be at least one, that has >= 2 items. - then there must be at least one, that has 0 or 1 item. ! If I put N item into M containers… - then there must be at least one, that has >= ceil(N/M) items [ceil(10/9) = 2] - then there must be at least one, that has <= floor(N/M) items [floor(10/9) = 1]
  • 19. MULTI-INDEX LSH Multi-Index “Blocking” We can use that principle to build an index for our hash bitsamples. 32 hash functions ➔ One 32 bit sample 90% Match ➔ 28 bits must match ➔ up to 4 can be unequal (➔ N) Split the 32 bit sample in 4 chunks of 8 bit ➔ at least one chunk has 0 or 1 unequal bit ! These chunks will now become our index. Imagine a map: 1st chunk ➔ [full 32 bit, …] 2nd chunk ➔ [full 32 bit, …] 3rd chunk ➔ [full 32 bit, …] 4th chunk ➔ [full 32 bit, …] ! As you see, we trade space for time…
  • 20. MULTI-INDEX LSH Multi-Index “Blocking” Our candidates are now those in the list behind the matching chunk. Split the 32 bit sample in 4 chunks of 8 bit ➔ at least one chunk has 0 or 1 unequal bit ➔ All relevant candidates are behind those chunks ! Lookup: – Split bit sample into chunks – Take all candidates behind one chunk and its bit distance siblings – You can stop, once you have candidates based on one source chunk – With all the candidates you then perform the final comparisons !
  • 21. MULTI-INDEX LSH Multi-Index “Blocking” Performance boundaries M - total number of messages to compare N - number of hash functions resp. bits in the sample K - number of chunks the bit sample is split into ! Insert O(1) - compute the K chunks and store in map ! Lookup O(1) - compute the K chunks and lookup the candidate set O(M * M / 2(N/K)) - compare every message (M) with all candidates (M / 2(N/K)) If we choose N and K so that 2(N/K) >> M, we can now achieve O(M), i.e linear scaling.
  • 23. HANDS ON! Mail Grouping for Spam Detection I wanted something else, but: Available data! http://untroubled.org/spam/ “This directory contains all the spam that I have received since early 1998. I have employed various "bait" addresses, … to trick email address harvesters into putting them on spam lists.” ! Spam mails from 2014 January + February: 85500 Mails ! 2(N/K) >> M 224 ~ 16 million messages K - 8 chunks à 24 bit N - 192 hash functions
  • 24. SpamEmailSpout Tokenizer AddToLshTable RiakLshIndexWriter lsh_APP_GROUP_0 RiakLshIndexWriter lsh_APP_GROUP_7 KEY INDEX VALUE … CHUNK0:CHUNK1:…|RELATION_ID 10000386:6442152:… |/Users/…/spam/2014/01/1388703495.txt CHUNK0_VALUE ● == 10000386 HASH == CHUNK0:CHUNK1:…:CHUNK7 RELATION_ID == /Users/…/spam/2014/01/1388703495.txt {“time_of_update” : 1395768557828, …} … Storm Riak Buckets
  • 25. DEMO
  • 27. THE FUTURE Applications Fingerprinting Create tokens from data available with JavaScript and HTTP Request: ! A: {“browser”:”Mozilla Firefox”,”plugins”:”Flash,Silverlight”,”zipcode”:20121} ➔ browser_mozilla, browser_firefox, plugin_flash, plugin_silverlight, zipcode_20121 ! B: {“browser”:”Google Chrome”,”plugins”:”Flash,Java”,”likes”:[123,124]} ➔ browser_google, browser_chrome, plugin_flash, plugin_java, like_123, like_124 ! ! API Server Storm Cluster + DB Q: Broker JS-Client in Browser WS/SSE
  • 28. THE FUTURE Applications Sequence Tracking Sequence Code: step1_$URL, step2_$URL, … Set Code: $URL1, $URL2, … ! • Gets you the most common ‘fuzzy’ sequences or sets • Variation: use a window for X successively visited sites, products, etc. • Add features, i.e. sets of bought items
  • 29. THE FUTURE Applications Fraud Often identities are generated after some pattern with variations: ! A: a_lastname32@yahoo.com B: alex_lastname746@yahoo.co.uk ! • Generally use redundant encoding using multiple encodings • Remove whitespace, split & use q-grams • Use phonetic encoding schemes • Use special knowledge about email, address, etc. to create tags • Use pre-filters and artisan blocking (i.e. major free mail providers) – Low number of tokens – Needs some research