SlideShare ist ein Scribd-Unternehmen logo
1 von 48
Hierarchical clustering
in Python & elsewhere
For @PyDataConf London, June 2015, by Frank Kelly
Data Scientist, Engineer @analyticsseo
@norhustla
Hierarchical
Clustering
Theory Practice Visualisation
Origins & definitions
Methods & considerations
Hierachical theory
Metrics & performance
My use case
Python libraries
Example
Static
Interactive
Further ideas
All opinions expressed are my own
Who am I?
All opinions expressed are my own
Attribution: www.alexmaclean.com
Clustering: a recap
Clustering is an unsupervised learning
problem
"SLINK-Gaussian-data" by Chire - Own work. Licensed under CC BY-SA 3.0 via Wikimedia Commons -
https://commons.wikimedia.org/wiki/File:SLINK-Gaussian-data.svg#/media/File:SLINK-Gaussian-data.svg
based on some
notion of similarity.
whereby we aim to
group subsets of
entities with one
another
Origins
1930s:
Anthropology
&
Psychology
http://dienekes.blogspot.co.uk/2013/12/europeans-neolithic-farmers-mesolithic.html
Diverse applications
Attribution: stack overflow, wikipedia, scikit-learn.org, http://www.poolparty.biz/
Two
main
purposes
Exploratory analysis – standalone tool
(Data mining)
As a component of a supervised learning
pipeline (in which distinct classifiers or
regression models are trained for each
cluster).
(Machine Learning)
Clustering considerations
Partitioning
criteria
(single /
multi level)
Separation
Exclusive /
non-
exclusive
Clustering
space
(Full-space /
sub-space)
Similarity
measure
(distance /
connectivity)
Use case: search keywords
RD
P
P
P
KW
KW
KW
KW
KW
CP
CP
KW
KW
KW
The
competition!
KW
KW
CP
CD
You
Opportunity!
CD = Competing domains
CP = Competitor’s pages
RD = Ranking domain
P = Your page
KW = Keyword
….x 100,000 !!
Use case: search keywords
KW…so we have found 100,000 new ‘s – now what?
How do we summarise and present these to a client?
Clients’ questions…
• Do search categories in general
align with my website structure?
• Which categories of opportunity
keywords have the highest
search volume, bring the most
visitors, revenue etc.?
• Which keywords are not
relevant?
Website-like structure
Requirements
• Need: visual insights;
structure
• Allow targeting of
problem in hand
• May develop into a
semi- supervised
solution
• High-dimensional and sparse
data set
• Values correspond to word
frequencies
• Recommended methods
include: hierarchical
clustering, Kmeans with an
appropriate distance measure,
topic modelling (LDA, LSI),
co-clustering
Options for text
clustering?
Hierarchical Clustering
bringing structure
2 types
Agglomerative
Divisive Deterministic algorithms!
Attribution: Wikipedia
Agglomerative
Start with many
“singleton” clusters
…
Merge 2 at a time
continuously
…
Build a hierarchy
Divisive
Start with a huge “macro”
cluster
…
Iteratively split into 2
groups
…
Build a hierarchy
Agglomerative method:
Linkage types
• Single (similarity between
most similar – based on nearest
neighbour - two elements)
• Complete (similarity between
most dissimilar two elements)
Attribution: https://www.coursera.org/course/clusteranalysis
Agglomerative method:
Linkage types
Average link
( avg. of similarity between
all inter-cluster pairs )
Computationally expensive (Na*Nb)
Trick: Centroid link (similarity
between centroid of two clusters)
Attribution: https://www.coursera.org/course/clusteranalysis
Ward’s criterion
• Minimise a function: total in-cluster variance
• As defined by, e.g.:
• Once merged, then the SSE will increase
(cluster becomes bigger) by:
https://en.wikipedia.org/wiki/Ward's_method
Divisive clustering
• Top-down approach
• Criterion to split: Ward’s criterion
• Handling noise: Use a threshold to determine
the termination criteria
Attribution: https://www.coursera.org/course/clusteranalysis
Similarity measures
This will certainly influence the shape of the
clusters!
• Numerical: Use a variation of the Manhattan
distance (e.g. City block, Euclidean)
• Binary: Manhattan, Jaccard co-efficient,
Hamming
• Text: Cosine similarity.
Cosine similarity
Represent a document by a bag of terms
Record the frequency of a particular term (word/ topic/ phrase)
If d1 and d2 are two term vectors,
…can thus calculate the similarity between them
Attribution: https://www.coursera.org/course/clusteranalysis
Gather word documents = keyword phrases
Aggregate search words with URL “words”
Text clustering:
preparations
• Add features where possible
o I added URL words to my word set
• Stem words
o Choose the right stemmer – too severe can be bad
• Stop words
o NLTK tokeniser
o Scikit learn TF-IDF tokeniser
• Low frequency cut-off
o 2 => words appearing less than twice in whole corpus
• High frequency cut-off
o 0.5 => words that appear in more than 50% of documents
• N-grams
o Single words, bi-grams, tri-grams
• Beware of foreign languages
o Separate datasets if possible
Text preparation
Dimensionality
• Get a sparse matrix
o Mostly zeros
• Reduce the number of dimensions
o PCA
o Spectral clustering
• The “curse” of dimensionality
Results: reduced dimensions
Results: reduced dimensions
The
dendrogram
Assess the quality of your
clusters
• Internal: Purity, completeness & homogeneity
• External: Adjusted Rand index, Normalised
Information index
Topic labelling
Hierarchical Clustering
Beyond Python (!?)
Life on the inside:
Elasticsearch
• Why not perform pre-processing and clustering
inside elasticsearch?
• Document store
• TF-IDF and other
• Stop words
• Language specific analysers
Elasticsearch
- try it ! -
• https://www.elastic.co/
• NoSQL document store
• Aggregations and stats
• Fast, distributed
• Quick to set up
Document storage in ES
Lingo 3G algorithm
• Lingo 3G: Hierarchical clustering off-the-shelf
• Built-in part of speech (POS)
• User-defined word/synonym/label dictionaries
• Built-in stemmer / word inflection database
• Multi-lingual support, advanced tuning
• Commercial: costs attached
http://download.carrotsearch.com/lingo3g/manual/#section.es
http://project.carrot2.org/algorithms.html
Elasticsearch with
clustering – Utopia?
Carrot2’s Lingo3G in action :
http://search.carrot2.org/stable/search
Foamtree visualisation example
Visualisation of hierarchical structure possible for
large datasets via “lazy loading”
http://get.carrotsearch.com/foamtree/demo/demos/large.html
Limitations of hierarchical
clustering
• Can’t undo what’s done (divisive method, work
on sub clusters, cannot re-merge). Even true for
agglomerative (once merged will never split it
again)
• Every split or merge must be refined
• Methods may not scale well, checking all possible
pairs, complexity goes high
There are extensions: BIRCH, CURE and
CHAMELEON
Thank you!
A decent introductory course to clustering;
https://www.coursera.org/course/clusteranalysis
Hierarchical (agglomerative) clustering in Python:
http://scikit-
learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html
Recent (ish) relevant Kaggle challenge: https://www.kaggle.com/c/lshtc
Visualisation: http://carrotsearch.com/foamtree-overview
Clustering elsewhere (Lingo, Lingo3G) with
Carrot2:http://download.carrotsearch.com/
Elasticsearch: https://www.elastic.co/
Analytics SEO: http://www.analyticsseo.com/
Me: @norhustla / frank.kelly@cantab.net
Attribution: http://wynway.com/
Extra slide: Why work
inside the database?
1. Sharing data (management of)
Support concurrent access by multiple readers and writers
2. Data Model Enforcement
Make sure all applications see clean, organised data
3. Scale
Work with datasets too large to fit in memory (over a certain size,
need specialised algorithms to deal with the data -> bottleneck)
The database organises and exposes algorithms for you
conveniently
4. Flexibility
Use the data in new, unanticipated ways -> anticipate a broad set
of ways of accessing the data

Weitere ähnliche Inhalte

Was ist angesagt?

Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision treesKnoldus Inc.
 
Knn Algorithm presentation
Knn Algorithm presentationKnn Algorithm presentation
Knn Algorithm presentationRishavSharma112
 
Dimension reduction techniques[Feature Selection]
Dimension reduction techniques[Feature Selection]Dimension reduction techniques[Feature Selection]
Dimension reduction techniques[Feature Selection]AAKANKSHA JAIN
 
Hierarchical clustering
Hierarchical clusteringHierarchical clustering
Hierarchical clusteringishmecse13
 
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...Simplilearn
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsMd. Main Uddin Rony
 
Big Data Analytics with R
Big Data Analytics with RBig Data Analytics with R
Big Data Analytics with RGreat Wide Open
 
Logistic regression in Machine Learning
Logistic regression in Machine LearningLogistic regression in Machine Learning
Logistic regression in Machine LearningKuppusamy P
 
What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?
What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?
What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?Smarten Augmented Analytics
 
CART – Classification & Regression Trees
CART – Classification & Regression TreesCART – Classification & Regression Trees
CART – Classification & Regression TreesHemant Chetwani
 
Machine learning and linear regression programming
Machine learning and linear regression programmingMachine learning and linear regression programming
Machine learning and linear regression programmingSoumya Mukherjee
 
Unsupervised learning
Unsupervised learningUnsupervised learning
Unsupervised learningamalalhait
 
Customer Segmentation using Clustering
Customer Segmentation using ClusteringCustomer Segmentation using Clustering
Customer Segmentation using ClusteringDessy Amirudin
 

Was ist angesagt? (20)

Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision trees
 
Knn Algorithm presentation
Knn Algorithm presentationKnn Algorithm presentation
Knn Algorithm presentation
 
Dimension reduction techniques[Feature Selection]
Dimension reduction techniques[Feature Selection]Dimension reduction techniques[Feature Selection]
Dimension reduction techniques[Feature Selection]
 
Hierarchical clustering
Hierarchical clusteringHierarchical clustering
Hierarchical clustering
 
Hierarchical clustering
Hierarchical clusteringHierarchical clustering
Hierarchical clustering
 
K - Nearest neighbor ( KNN )
K - Nearest neighbor  ( KNN )K - Nearest neighbor  ( KNN )
K - Nearest neighbor ( KNN )
 
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
 
Hierachical clustering
Hierachical clusteringHierachical clustering
Hierachical clustering
 
Chapter8
Chapter8Chapter8
Chapter8
 
Dbscan algorithom
Dbscan algorithomDbscan algorithom
Dbscan algorithom
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 
Big Data Analytics with R
Big Data Analytics with RBig Data Analytics with R
Big Data Analytics with R
 
Presentation on K-Means Clustering
Presentation on K-Means ClusteringPresentation on K-Means Clustering
Presentation on K-Means Clustering
 
Logistic regression in Machine Learning
Logistic regression in Machine LearningLogistic regression in Machine Learning
Logistic regression in Machine Learning
 
What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?
What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?
What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
 
CART – Classification & Regression Trees
CART – Classification & Regression TreesCART – Classification & Regression Trees
CART – Classification & Regression Trees
 
Machine learning and linear regression programming
Machine learning and linear regression programmingMachine learning and linear regression programming
Machine learning and linear regression programming
 
Unsupervised learning
Unsupervised learningUnsupervised learning
Unsupervised learning
 
Customer Segmentation using Clustering
Customer Segmentation using ClusteringCustomer Segmentation using Clustering
Customer Segmentation using Clustering
 

Andere mochten auch

Machine Learning and Data Mining: 08 Clustering: Hierarchical
Machine Learning and Data Mining: 08 Clustering: Hierarchical Machine Learning and Data Mining: 08 Clustering: Hierarchical
Machine Learning and Data Mining: 08 Clustering: Hierarchical Pier Luca Lanzi
 
K-means and Hierarchical Clustering
K-means and Hierarchical ClusteringK-means and Hierarchical Clustering
K-means and Hierarchical Clusteringguestfee8698
 
Text clustering
Text clusteringText clustering
Text clusteringKU Leuven
 
Scaling Document Clustering in the Cloud
Scaling Document Clustering in the CloudScaling Document Clustering in the Cloud
Scaling Document Clustering in the CloudRob Gillen
 
28 Machine Learning Unsupervised Hierarchical Clustering
28 Machine Learning Unsupervised Hierarchical Clustering28 Machine Learning Unsupervised Hierarchical Clustering
28 Machine Learning Unsupervised Hierarchical ClusteringAndres Mendez-Vazquez
 
Weka_Manual_Sagar
Weka_Manual_SagarWeka_Manual_Sagar
Weka_Manual_SagarSagar Kumar
 
Machine Learning and Data Mining: 05 Advanced Association Rule Mining
Machine Learning and Data Mining: 05 Advanced Association Rule MiningMachine Learning and Data Mining: 05 Advanced Association Rule Mining
Machine Learning and Data Mining: 05 Advanced Association Rule MiningPier Luca Lanzi
 
Data mining techniques using weka
Data mining techniques using wekaData mining techniques using weka
Data mining techniques using wekarathorenitin87
 
The Open-Source Monitoring Landscape
The Open-Source Monitoring LandscapeThe Open-Source Monitoring Landscape
The Open-Source Monitoring LandscapeMike Merideth
 
Document Classification and Clustering
Document Classification and ClusteringDocument Classification and Clustering
Document Classification and ClusteringAnkur Shrivastava
 
Document clustering and classification
Document clustering and classification Document clustering and classification
Document clustering and classification Mahmoud Alfarra
 
[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing Systems
[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing Systems[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing Systems
[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing SystemsShuyo Nakatani
 
Graph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkXGraph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkXBenjamin Bengfort
 
Extreme Extraction - Machine Reading in a Week
Extreme Extraction - Machine Reading in a WeekExtreme Extraction - Machine Reading in a Week
Extreme Extraction - Machine Reading in a WeekShuyo Nakatani
 

Andere mochten auch (20)

Machine Learning and Data Mining: 08 Clustering: Hierarchical
Machine Learning and Data Mining: 08 Clustering: Hierarchical Machine Learning and Data Mining: 08 Clustering: Hierarchical
Machine Learning and Data Mining: 08 Clustering: Hierarchical
 
K-means and Hierarchical Clustering
K-means and Hierarchical ClusteringK-means and Hierarchical Clustering
K-means and Hierarchical Clustering
 
Text clustering
Text clusteringText clustering
Text clustering
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
08 clustering
08 clustering08 clustering
08 clustering
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysis
 
NLTK in 20 minutes
NLTK in 20 minutesNLTK in 20 minutes
NLTK in 20 minutes
 
Scaling Document Clustering in the Cloud
Scaling Document Clustering in the CloudScaling Document Clustering in the Cloud
Scaling Document Clustering in the Cloud
 
Alz Hack II
Alz Hack IIAlz Hack II
Alz Hack II
 
28 Machine Learning Unsupervised Hierarchical Clustering
28 Machine Learning Unsupervised Hierarchical Clustering28 Machine Learning Unsupervised Hierarchical Clustering
28 Machine Learning Unsupervised Hierarchical Clustering
 
Weka_Manual_Sagar
Weka_Manual_SagarWeka_Manual_Sagar
Weka_Manual_Sagar
 
Machine Learning and Data Mining: 05 Advanced Association Rule Mining
Machine Learning and Data Mining: 05 Advanced Association Rule MiningMachine Learning and Data Mining: 05 Advanced Association Rule Mining
Machine Learning and Data Mining: 05 Advanced Association Rule Mining
 
Data mining techniques using weka
Data mining techniques using wekaData mining techniques using weka
Data mining techniques using weka
 
The Open-Source Monitoring Landscape
The Open-Source Monitoring LandscapeThe Open-Source Monitoring Landscape
The Open-Source Monitoring Landscape
 
Data Mining using Weka
Data Mining using WekaData Mining using Weka
Data Mining using Weka
 
Document Classification and Clustering
Document Classification and ClusteringDocument Classification and Clustering
Document Classification and Clustering
 
Document clustering and classification
Document clustering and classification Document clustering and classification
Document clustering and classification
 
[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing Systems
[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing Systems[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing Systems
[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing Systems
 
Graph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkXGraph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkX
 
Extreme Extraction - Machine Reading in a Week
Extreme Extraction - Machine Reading in a WeekExtreme Extraction - Machine Reading in a Week
Extreme Extraction - Machine Reading in a Week
 

Ähnlich wie Hierarchical clustering in Python and beyond

clustering_classification.ppt
clustering_classification.pptclustering_classification.ppt
clustering_classification.pptHODECE21
 
Knowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything ProjectKnowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything ProjectEnrico Daga
 
AAT LOD Microthesauri
AAT LOD MicrothesauriAAT LOD Microthesauri
AAT LOD MicrothesauriMarcia Zeng
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
Metadata first, ontologies second
Metadata first, ontologies secondMetadata first, ontologies second
Metadata first, ontologies secondJoseba Abaitua
 
DITA's New Thang: Going Mapless!
DITA's New Thang: Going Mapless!DITA's New Thang: Going Mapless!
DITA's New Thang: Going Mapless!dclsocialmedia
 
Linking Folksonomies to Knowledge Organization Systems
Linking Folksonomies to Knowledge Organization SystemsLinking Folksonomies to Knowledge Organization Systems
Linking Folksonomies to Knowledge Organization SystemsJakob .
 
The Rhetoric of Research Objects
The Rhetoric of Research ObjectsThe Rhetoric of Research Objects
The Rhetoric of Research ObjectsCarole Goble
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningRahul Jain
 
Collaborations in the Extreme: 
The rise of open code development in the scie...
Collaborations in the Extreme: 
The rise of open code development in the scie...Collaborations in the Extreme: 
The rise of open code development in the scie...
Collaborations in the Extreme: 
The rise of open code development in the scie...Kelle Cruz
 
Research Objects for improved sharing and reproducibility
Research Objects for improved sharing and reproducibilityResearch Objects for improved sharing and reproducibility
Research Objects for improved sharing and reproducibilityOscar Corcho
 
Recommending Semantic Nearest Neighbors Using Storm and Dato
Recommending Semantic Nearest Neighbors Using Storm and DatoRecommending Semantic Nearest Neighbors Using Storm and Dato
Recommending Semantic Nearest Neighbors Using Storm and DatoAshok Venkatesan
 
Open Archives Initiative Object Reuse and Exchange
Open Archives Initiative Object Reuse and ExchangeOpen Archives Initiative Object Reuse and Exchange
Open Archives Initiative Object Reuse and Exchangelagoze
 
Reduce Query Time Up to 60% with Selective Search
Reduce Query Time Up to 60% with Selective SearchReduce Query Time Up to 60% with Selective Search
Reduce Query Time Up to 60% with Selective SearchLucidworks
 

Ähnlich wie Hierarchical clustering in Python and beyond (20)

04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
clustering_classification.ppt
clustering_classification.pptclustering_classification.ppt
clustering_classification.ppt
 
Knowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything ProjectKnowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything Project
 
AAT LOD Microthesauri
AAT LOD MicrothesauriAAT LOD Microthesauri
AAT LOD Microthesauri
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
Metadata first, ontologies second
Metadata first, ontologies secondMetadata first, ontologies second
Metadata first, ontologies second
 
Elastic pivorak
Elastic pivorakElastic pivorak
Elastic pivorak
 
DITA's New Thang: Going Mapless!
DITA's New Thang: Going Mapless!DITA's New Thang: Going Mapless!
DITA's New Thang: Going Mapless!
 
A Clean Slate?
A Clean Slate?A Clean Slate?
A Clean Slate?
 
Linking Folksonomies to Knowledge Organization Systems
Linking Folksonomies to Knowledge Organization SystemsLinking Folksonomies to Knowledge Organization Systems
Linking Folksonomies to Knowledge Organization Systems
 
The Rhetoric of Research Objects
The Rhetoric of Research ObjectsThe Rhetoric of Research Objects
The Rhetoric of Research Objects
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Collaborations in the Extreme: 
The rise of open code development in the scie...
Collaborations in the Extreme: 
The rise of open code development in the scie...Collaborations in the Extreme: 
The rise of open code development in the scie...
Collaborations in the Extreme: 
The rise of open code development in the scie...
 
Research Objects for improved sharing and reproducibility
Research Objects for improved sharing and reproducibilityResearch Objects for improved sharing and reproducibility
Research Objects for improved sharing and reproducibility
 
Recommender Systems and Linked Open Data
Recommender Systems and Linked Open DataRecommender Systems and Linked Open Data
Recommender Systems and Linked Open Data
 
Recommending Semantic Nearest Neighbors Using Storm and Dato
Recommending Semantic Nearest Neighbors Using Storm and DatoRecommending Semantic Nearest Neighbors Using Storm and Dato
Recommending Semantic Nearest Neighbors Using Storm and Dato
 
UCIAD overview
UCIAD overviewUCIAD overview
UCIAD overview
 
Open Archives Initiative Object Reuse and Exchange
Open Archives Initiative Object Reuse and ExchangeOpen Archives Initiative Object Reuse and Exchange
Open Archives Initiative Object Reuse and Exchange
 
Reduce Query Time Up to 60% with Selective Search
Reduce Query Time Up to 60% with Selective SearchReduce Query Time Up to 60% with Selective Search
Reduce Query Time Up to 60% with Selective Search
 
Keynote at AImWD
Keynote at AImWDKeynote at AImWD
Keynote at AImWD
 

Kürzlich hochgeladen

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsJoseMangaJr1
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx9to5mart
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...amitlee9823
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachBoston Institute of Analytics
 

Kürzlich hochgeladen (20)

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 

Hierarchical clustering in Python and beyond

  • 1. Hierarchical clustering in Python & elsewhere For @PyDataConf London, June 2015, by Frank Kelly Data Scientist, Engineer @analyticsseo @norhustla
  • 2. Hierarchical Clustering Theory Practice Visualisation Origins & definitions Methods & considerations Hierachical theory Metrics & performance My use case Python libraries Example Static Interactive Further ideas All opinions expressed are my own
  • 3. Who am I? All opinions expressed are my own
  • 5. Clustering is an unsupervised learning problem "SLINK-Gaussian-data" by Chire - Own work. Licensed under CC BY-SA 3.0 via Wikimedia Commons - https://commons.wikimedia.org/wiki/File:SLINK-Gaussian-data.svg#/media/File:SLINK-Gaussian-data.svg based on some notion of similarity. whereby we aim to group subsets of entities with one another
  • 7. Diverse applications Attribution: stack overflow, wikipedia, scikit-learn.org, http://www.poolparty.biz/
  • 8. Two main purposes Exploratory analysis – standalone tool (Data mining) As a component of a supervised learning pipeline (in which distinct classifiers or regression models are trained for each cluster). (Machine Learning)
  • 9. Clustering considerations Partitioning criteria (single / multi level) Separation Exclusive / non- exclusive Clustering space (Full-space / sub-space) Similarity measure (distance / connectivity)
  • 10. Use case: search keywords RD P P P KW KW KW KW KW CP CP KW KW KW The competition! KW KW CP CD You Opportunity! CD = Competing domains CP = Competitor’s pages RD = Ranking domain P = Your page KW = Keyword
  • 12. Use case: search keywords KW…so we have found 100,000 new ‘s – now what? How do we summarise and present these to a client?
  • 13. Clients’ questions… • Do search categories in general align with my website structure? • Which categories of opportunity keywords have the highest search volume, bring the most visitors, revenue etc.? • Which keywords are not relevant?
  • 15. Requirements • Need: visual insights; structure • Allow targeting of problem in hand • May develop into a semi- supervised solution
  • 16. • High-dimensional and sparse data set • Values correspond to word frequencies • Recommended methods include: hierarchical clustering, Kmeans with an appropriate distance measure, topic modelling (LDA, LSI), co-clustering Options for text clustering?
  • 18. 2 types Agglomerative Divisive Deterministic algorithms! Attribution: Wikipedia
  • 19. Agglomerative Start with many “singleton” clusters … Merge 2 at a time continuously … Build a hierarchy Divisive Start with a huge “macro” cluster … Iteratively split into 2 groups … Build a hierarchy
  • 20. Agglomerative method: Linkage types • Single (similarity between most similar – based on nearest neighbour - two elements) • Complete (similarity between most dissimilar two elements) Attribution: https://www.coursera.org/course/clusteranalysis
  • 21.
  • 22.
  • 23. Agglomerative method: Linkage types Average link ( avg. of similarity between all inter-cluster pairs ) Computationally expensive (Na*Nb) Trick: Centroid link (similarity between centroid of two clusters) Attribution: https://www.coursera.org/course/clusteranalysis
  • 24. Ward’s criterion • Minimise a function: total in-cluster variance • As defined by, e.g.: • Once merged, then the SSE will increase (cluster becomes bigger) by: https://en.wikipedia.org/wiki/Ward's_method
  • 25. Divisive clustering • Top-down approach • Criterion to split: Ward’s criterion • Handling noise: Use a threshold to determine the termination criteria Attribution: https://www.coursera.org/course/clusteranalysis
  • 26. Similarity measures This will certainly influence the shape of the clusters! • Numerical: Use a variation of the Manhattan distance (e.g. City block, Euclidean) • Binary: Manhattan, Jaccard co-efficient, Hamming • Text: Cosine similarity.
  • 27. Cosine similarity Represent a document by a bag of terms Record the frequency of a particular term (word/ topic/ phrase) If d1 and d2 are two term vectors, …can thus calculate the similarity between them Attribution: https://www.coursera.org/course/clusteranalysis
  • 28.
  • 29. Gather word documents = keyword phrases
  • 30. Aggregate search words with URL “words”
  • 31. Text clustering: preparations • Add features where possible o I added URL words to my word set • Stem words o Choose the right stemmer – too severe can be bad • Stop words o NLTK tokeniser o Scikit learn TF-IDF tokeniser • Low frequency cut-off o 2 => words appearing less than twice in whole corpus • High frequency cut-off o 0.5 => words that appear in more than 50% of documents • N-grams o Single words, bi-grams, tri-grams • Beware of foreign languages o Separate datasets if possible
  • 33. Dimensionality • Get a sparse matrix o Mostly zeros • Reduce the number of dimensions o PCA o Spectral clustering • The “curse” of dimensionality
  • 34.
  • 38. Assess the quality of your clusters • Internal: Purity, completeness & homogeneity • External: Adjusted Rand index, Normalised Information index
  • 41. Life on the inside: Elasticsearch • Why not perform pre-processing and clustering inside elasticsearch? • Document store • TF-IDF and other • Stop words • Language specific analysers
  • 42. Elasticsearch - try it ! - • https://www.elastic.co/ • NoSQL document store • Aggregations and stats • Fast, distributed • Quick to set up
  • 44. Lingo 3G algorithm • Lingo 3G: Hierarchical clustering off-the-shelf • Built-in part of speech (POS) • User-defined word/synonym/label dictionaries • Built-in stemmer / word inflection database • Multi-lingual support, advanced tuning • Commercial: costs attached http://download.carrotsearch.com/lingo3g/manual/#section.es http://project.carrot2.org/algorithms.html
  • 45. Elasticsearch with clustering – Utopia? Carrot2’s Lingo3G in action : http://search.carrot2.org/stable/search Foamtree visualisation example Visualisation of hierarchical structure possible for large datasets via “lazy loading” http://get.carrotsearch.com/foamtree/demo/demos/large.html
  • 46. Limitations of hierarchical clustering • Can’t undo what’s done (divisive method, work on sub clusters, cannot re-merge). Even true for agglomerative (once merged will never split it again) • Every split or merge must be refined • Methods may not scale well, checking all possible pairs, complexity goes high There are extensions: BIRCH, CURE and CHAMELEON
  • 47. Thank you! A decent introductory course to clustering; https://www.coursera.org/course/clusteranalysis Hierarchical (agglomerative) clustering in Python: http://scikit- learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html Recent (ish) relevant Kaggle challenge: https://www.kaggle.com/c/lshtc Visualisation: http://carrotsearch.com/foamtree-overview Clustering elsewhere (Lingo, Lingo3G) with Carrot2:http://download.carrotsearch.com/ Elasticsearch: https://www.elastic.co/ Analytics SEO: http://www.analyticsseo.com/ Me: @norhustla / frank.kelly@cantab.net Attribution: http://wynway.com/
  • 48. Extra slide: Why work inside the database? 1. Sharing data (management of) Support concurrent access by multiple readers and writers 2. Data Model Enforcement Make sure all applications see clean, organised data 3. Scale Work with datasets too large to fit in memory (over a certain size, need specialised algorithms to deal with the data -> bottleneck) The database organises and exposes algorithms for you conveniently 4. Flexibility Use the data in new, unanticipated ways -> anticipate a broad set of ways of accessing the data

Hinweis der Redaktion

  1. http://dienekes.blogspot.co.uk/2013/12/europeans-neolithic-farmers-mesolithic.html
  2. https://upload.wikimedia.org/wikipedia/commons/3/39/Swiss_complete.png