Fashion product de-duplication with image similarity and LSH

•

9 gefällt mir•5,201 views

Eddie Bell

My talk from pydata London 12/2014

Daten & Analysen

Matching Fashion Products
With LSH Image Similarity

We collect the world of
fashion into a customisable
shopping experience.
3

What makes us different?
All data is scraped from retailers
500 spiders (scrapy), 9000 designers
Almost everything is automated
SEO, recommendation, classiﬁcation, sales
This architecture comes with a few problems

Why do we get duplicates?
There is no ISBN for fashion
Burberry Selfridges
inter-retailer
intra-retailer intra-retailer

How We Used to Find Duplicates
Lucene fuzzy string matching
Doesn’t really work
Yoox.com
3,000 products called “dress”
7,000 products called “shirt”

How We Detect Duplicates Now
BRISK image descriptors
Leutenegger, Chli and Siegwart
BRISK: Binary Robust Invariant Scalable Keypoints.
ICCV 2011: 2548-2555

BRISK octaves
How We Detect Duplicates Now

From Descriptors to Image Similarity
k-means
bag of words
1 x k

Architecture
Started in Storm/Java
Very painful
Ended up in Celery
Much nicer
Matching is done in elastic search.

Speed
We apply some ﬁlters but still lots of data
Could be doing 1,000,000+ comparisons per image
Speed it up with local sensitivity hashing

Local
Sensitivity
Hashing
1
1
1
1 1
1
1
1
1
1 1
0
0
0
0
0
0
0
1
1
0
0

Local
Sensitivity
Hashing
10
10
10
10 11
10
10
11
11
11 11
01
00
01
00
01
00
00
00
10
10
00

Local
Sensitivity
Hashing
101
101
101
101 111
101
100
111
110
110 110
010
000
010
000
010
000
000
000
001
101
101

Local
Sensitivity
Hashing
1011
1011
1011
1011 1111
1011
1001
1111
1101
1101 1101
0101
0001
0101
0001
0101
0000
0000
0000
0010
1010
1010

LSH (Random Projections)
Split the n-dimensional
code in to q r-bit integers.
0 0 0 1 1 1 1 0 1 1 1 0 0 0 1 0
0 0 0 1 = 0 1
1 1 1 0 = 1 4
1 1 1 0 = 1 4
0 0 1 0 = 0 2
Map d-dimensional
space into n-dimensional
binary code.
0 0 0 1 1 1 1 0 1 1 1 0 0 0 1 0
1 0 0 0 0 0 1 0 1 1 0 1 1 0 0 1
0 0 1 0 1 1 0 0 0 0 1 1 1 1 1 0
1 1 0 0 1 1 0 0 1 1 0 0 1 0 1 0
1 1 1 1 0 0 0 1 0 1 1 1 0 1 0 1
0 1 0 1 0 1 1 1 0 1 1 0 0 1 0 1
Use the integer codes (i.e hash) to filter for similarity.

Parameters
The Johnson-Lindestrauss lemma states that
For a n×m matrix we can project into…
d = O(log n)
… dimensions and still preserve distance.

What’s Next
Reverse image search
This works! We tried during a hackathon
Similiar textual features
i.e. word embeddings
Dual image / text vector embeddings

Empfohlen

Finding needles in haystacks with deep neural networksCalvin Giles

Ai in 45 minutes昉达王

John Maxwell, Data Scientist, Nordstrom at MLconf Seattle 2017 MLconf

Dm part03 neural-networks-homeworkokeee

Machine Learning BasicsHumberto Marchezi

08 neural networksankit_ppt

Computer Vision: Correlation, Convolution, and GradientAhmed Gad

Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...Simplilearn

Empfohlen

Finding needles in haystacks with deep neural networksCalvin Giles

Ai in 45 minutes昉达王

John Maxwell, Data Scientist, Nordstrom at MLconf Seattle 2017 MLconf

Dm part03 neural-networks-homeworkokeee

Machine Learning BasicsHumberto Marchezi

08 neural networksankit_ppt

Computer Vision: Correlation, Convolution, and GradientAhmed Gad

Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...Simplilearn

InfoGAN and Generative Adversarial NetworksZak Jost

Support Vector Machine and Implementation using WekaMacha Pujitha

Introduction to Machine LearningNandita Naik

Analytical Study of AES and Proposed Variant with Enhance Block Length and Ke...International Journal of Science and Research (IJSR)

Hitchhiker Trees - Strangeloop 2016David Greenberg

Support Vector Machine (Classification) - Step by StepManish nath choudhary

Chapter3 bpkumar tm

Machine LearningCarlos J. Costa

Basics of Machine LearningPranav Challa

Generative Adversarial Networks (GANs) - Ian Goodfellow, OpenAIWithTheBest

Object recognitionsaniacorreya

Spot the Dog: An overview of semantic retrieval of unannotated images in the ...Jonathon Hare

David Barber - Deep Nets, Bayes and the story of AIBayes Nets meetup London

SNLI_presentation_2Viral Gupta

Ruby EgisonRakuten Group, Inc.

Data Science - Part XVII - Deep Learning & Image ProcessingDerek Kane

Yoav Goldberg: Word Embeddings What, How and WhitherMLReview

Word_Embedding.pptxNameetDaga1

ShapGiovanni Bruner

17- Kernels and Clustering.pptxssuser2023c6

Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...Jonathon Hare

The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017StampedeCon

Weitere ähnliche Inhalte

Was ist angesagt?

InfoGAN and Generative Adversarial NetworksZak Jost

Support Vector Machine and Implementation using WekaMacha Pujitha

Introduction to Machine LearningNandita Naik

Analytical Study of AES and Proposed Variant with Enhance Block Length and Ke...International Journal of Science and Research (IJSR)

Hitchhiker Trees - Strangeloop 2016David Greenberg

Support Vector Machine (Classification) - Step by StepManish nath choudhary

Chapter3 bpkumar tm

Machine LearningCarlos J. Costa

Basics of Machine LearningPranav Challa

Generative Adversarial Networks (GANs) - Ian Goodfellow, OpenAIWithTheBest

Was ist angesagt? (10)

InfoGAN and Generative Adversarial Networks

Support Vector Machine and Implementation using Weka

Introduction to Machine Learning

Analytical Study of AES and Proposed Variant with Enhance Block Length and Ke...

Hitchhiker Trees - Strangeloop 2016

Support Vector Machine (Classification) - Step by Step

Chapter3 bp

Machine Learning

Basics of Machine Learning

Generative Adversarial Networks (GANs) - Ian Goodfellow, OpenAI

Ähnlich wie Fashion product de-duplication with image similarity and LSH

Object recognitionsaniacorreya

Spot the Dog: An overview of semantic retrieval of unannotated images in the ...Jonathon Hare

David Barber - Deep Nets, Bayes and the story of AIBayes Nets meetup London

SNLI_presentation_2Viral Gupta

Ruby EgisonRakuten Group, Inc.

Data Science - Part XVII - Deep Learning & Image ProcessingDerek Kane

Yoav Goldberg: Word Embeddings What, How and WhitherMLReview

Word_Embedding.pptxNameetDaga1

ShapGiovanni Bruner

17- Kernels and Clustering.pptxssuser2023c6

Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...Jonathon Hare

The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017StampedeCon

Using content-based multimedia similarity search for learningsuzreader

Deep Generative Models Chia-Wen Cheng

Quoc le, slides MLconf 11/15/13MLconf

Big Data LDN 2018: AI MEETS MAIL PROCESSINGMatt Stubbs

How to use transfer learning to bootstrap image classification and question a...Wee Hyong Tok

Android and Deep LearningOswald Campesato

06_features_slides.pdfJanuarAdiPutra3

Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Universitat Politècnica de Catalunya

Ähnlich wie Fashion product de-duplication with image similarity and LSH (20)

Object recognition

Spot the Dog: An overview of semantic retrieval of unannotated images in the ...

David Barber - Deep Nets, Bayes and the story of AI

SNLI_presentation_2

Ruby Egison

Data Science - Part XVII - Deep Learning & Image Processing

Yoav Goldberg: Word Embeddings What, How and Whither

Word_Embedding.pptx

Shap

17- Kernels and Clustering.pptx

Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...

The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017

Using content-based multimedia similarity search for learning

Deep Generative Models

Quoc le, slides MLconf 11/15/13

Big Data LDN 2018: AI MEETS MAIL PROCESSING

How to use transfer learning to bootstrap image classification and question a...

Android and Deep Learning

06_features_slides.pdf

Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018

Mehr von Eddie Bell

Weak supervision - Pydata London 2019Eddie Bell

Working with Fashion Models - PyDataLondon 2016Eddie Bell

Learned RepresentationsEddie Bell

PyData London CNN Lightning TalkEddie Bell

The dark art of search relevancyEddie Bell

The Science of Colour (ExtractConf)Eddie Bell

Mehr von Eddie Bell (6)

Weak supervision - Pydata London 2019

Working with Fashion Models - PyDataLondon 2016

Learned Representations

PyData London CNN Lightning Talk

The dark art of search relevancy

The Science of Colour (ExtractConf)

Kürzlich hochgeladen

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375

NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics

Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy

INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort

20240419 - Measurecamp Amsterdam - SAM.pdfHuman37

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss

Multiple time frame trading analysis -brianshannon.pdfchwongval

GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch

While-For-loop in python used in collegessuser7a7cd61

DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett

Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort

Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534

Easter Eggs From Star Wars and in cars 1 and 217djon017

Machine learning classification ppt.pptamreenkhanum0307

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics

Kürzlich hochgeladen (20)

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...

NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx

Student Profile Sample report on improving academic performance by uniting gr...

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service

20240419 - Measurecamp Amsterdam - SAM.pdf

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一

Multiple time frame trading analysis -brianshannon.pdf

GA4 Without Cookies [Measure Camp AMS]

While-For-loop in python used in college

DBA Basics: Getting Started with Performance Tuning.pdf

Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)

Predicting Salary Using Data Science: A Comprehensive Analysis.pdf

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...

Easter Eggs From Star Wars and in cars 1 and 2

Machine learning classification ppt.ppt

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT

Fashion product de-duplication with image similarity and LSH

1. Matching Fashion Products With LSH Image Similarity

2. Hi, I’m Eddie. @ejbell

3. We collect the world of fashion into a customisable shopping experience. 3

4. What makes us different? All data is scraped from retailers 500 spiders (scrapy), 9000 designers Almost everything is automated SEO, recommendation, classiﬁcation, sales This architecture comes with a few problems

5. Why do we get duplicates? There is no ISBN for fashion Burberry Selfridges inter-retailer intra-retailer intra-retailer

6. How We Used to Find Duplicates Lucene fuzzy string matching Doesn’t really work Yoox.com 3,000 products called “dress” 7,000 products called “shirt”

7. How We Detect Duplicates Now BRISK image descriptors Leutenegger, Chli and Siegwart BRISK: Binary Robust Invariant Scalable Keypoints. ICCV 2011: 2548-2555

8. BRISK octaves How We Detect Duplicates Now

9. From Descriptors to Image Similarity k-means bag of words 1 x k

10. Architecture Started in Storm/Java Very painful Ended up in Celery Much nicer Matching is done in elastic search.

11. Results

12. Results

13. Results

14. Results

15. Speed We apply some ﬁlters but still lots of data Could be doing 1,000,000+ comparisons per image Speed it up with local sensitivity hashing

16. Local Sensitivity Hashing

17. Local Sensitivity Hashing

18. Local Sensitivity Hashing 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 0 0

19. Local Sensitivity Hashing 10 10 10 10 11 10 10 11 11 11 11 01 00 01 00 01 00 00 00 10 10 00

20. Local Sensitivity Hashing 101 101 101 101 111 101 100 111 110 110 110 010 000 010 000 010 000 000 000 001 101 101

21. Local Sensitivity Hashing 1011 1011 1011 1011 1111 1011 1001 1111 1101 1101 1101 0101 0001 0101 0001 0101 0000 0000 0000 0010 1010 1010

22. Local Sensitivity Hashing 1011 1011 1011 1011 1111 1011 1001 1111 1101 1101 1101 0101 0001 0101 0001 0101 0000 0000 0000 0010 1010 1010

23. LSH (Random Projections) Split the n-dimensional code in to q r-bit integers. 0 0 0 1 1 1 1 0 1 1 1 0 0 0 1 0 0 0 0 1 = 0 1 1 1 1 0 = 1 4 1 1 1 0 = 1 4 0 0 1 0 = 0 2 Map d-dimensional space into n-dimensional binary code. 0 0 0 1 1 1 1 0 1 1 1 0 0 0 1 0 1 0 0 0 0 0 1 0 1 1 0 1 1 0 0 1 0 0 1 0 1 1 0 0 0 0 1 1 1 1 1 0 1 1 0 0 1 1 0 0 1 1 0 0 1 0 1 0 1 1 1 1 0 0 0 1 0 1 1 1 0 1 0 1 0 1 0 1 0 1 1 1 0 1 1 0 0 1 0 1 Use the integer codes (i.e hash) to filter for similarity.

24. Parameters The Johnson-Lindestrauss lemma states that For a n×m matrix we can project into… d = O(log n) … dimensions and still preserve distance.

25. Performance …?!

26. Other “useful” applications

27. Color Variants

28. Matching Sets

29. Outfits

30. Model Faces

31. What’s Next Reverse image search This works! We tried during a hackathon Similiar textual features i.e. word embeddings Dual image / text vector embeddings

32. thank you @ejbell