SlideShare ist ein Scribd-Unternehmen logo
1 von 40
Downloaden Sie, um offline zu lesen
Data Analysis Course
Cluster Analysis
Venkat Reddy
Contents
• What is the need of Segmentation
• Introduction to Segmentation & Cluster analysis
• Applications of Cluster Analysis
• Types of Clusters
• K-Means clustering
DataAnalysisCourse
VenkatReddy
2
What is the need of segmentation?
Problem:
• 10,000 Customers - we know their age, city name, income,
employment status, designation
• You have to sell 100 Blackberry phones(each costs $1000) to
the people in this group. You have maximum of 7 days
• If you start giving demos to each individual, 10,000 demos will
take more than one year. How will you sell maximum number
of phones by giving minimum number of demos?
DataAnalysisCourse
VenkatReddy
3
What is the need of segmentation?
Solution
• Divide the whole population into two groups employed / unemployed
• Further divide the employed population into two groups high/low salary
• Further divide that group into high /low designation
DataAnalysisCourse
VenkatReddy
4
10000
customers
Unemployed
3000
Employed
7000
Low salary
5000
High Salary
2000
Low
Designation
1800
High
Designation
200
Segmentation and Cluster Analysis
• Cluster is a group of similar objects (cases, points, observations,
examples, members, customers, patients, locations, etc)
• Finding the groups of cases/observations/ objects in the
population such that the objects are
• Homogeneous within the group (high intra-class similarity)
• Heterogeneous between the groups(low inter-class similarity )
DataAnalysisCourse
VenkatReddy
5
Inter-cluster
distances are
maximized
Intra-cluster distances are
minimized
DataAnalysisCourse
VenkatReddy
Applications of Cluster Analysis
• Market Segmentation: Grouping people (with the willingness,
purchasing power, and the authority to buy) according to their
similarity in several dimensions related to a product under
consideration.
• Sales Segmentation: Clustering can tell you what types of customers
buy what products
• Credit Risk: Segmentation of customers based on their credit history
• Operations: High performer segmentation & promotions based on
person’s performance
• Insurance: Identifying groups of motor insurance policy holders with
a high average claim cost.
• City-planning: Identifying groups of houses according to their house
type, value, and geographical location
• Geographical: Identification of areas of similar land use in an earth
observation database.
DataAnalysisCourse
VenkatReddy
6
Types of Clusters
DataAnalysisCourse
VenkatReddy
7
• Partitional clustering or non-hierarchical : A division
of objects into non-overlapping subsets (clusters) such
that each object is in exactly one cluster
• The non-hierarchical methods divide a dataset of N
objects into M clusters.
• K-means clustering, a non-hierarchical technique, is
the most commonly used one in business analytics
• Hierarchical clustering: A set of nested clusters
organized as a hierarchical tree
• The hierarchical methods produce a set of nested
clusters in which each pair of objects or clusters is
progressively nested in a larger cluster until only one
cluster remains
• CHAID tree is most widely used in business analytics
Cluster Analysis -Example
DataAnalysisCourse
VenkatReddy
8
Maths Science Gk Apt
Student-1 94 82 87 89
Student-2 46 67 33 72
Student-3 98 97 93 100
Student-4 14 5 7 24
Student-5 86 97 95 95
Student-6 34 32 75 66
Student-7 69 44 59 55
Student-8 85 90 96 89
Student-9 24 26 15 22
Maths Science Gk Apt
Student-1 94 82 87 89
Student-2 46 67 33 72
Student-3 98 97 93 100
Student-4 14 5 7 24
Student-5 86 97 95 95
Student-6 34 32 75 66
Student-7 69 44 59 55
Student-8 85 90 96 89
Student-9 24 26 15 22
Maths Science Gk Apt
Student-4 14 5 7 24
Student-9 24 26 15 22
Student-6 34 32 75 66
Student-2 46 67 33 72
Student-7 69 44 59 55
Student-8 85 90 96 89
Student-5 86 97 95 95
Student-1 94 82 87 89
Student-3 98 97 93 100
4,9,6
2,7
8,5,1,3
Building Clusters
1. Select a distance measure
2. Select a clustering algorithm
3. Define the distance between two clusters
4. Determine the number of clusters
5. Validate the analysis
DataAnalysisCourse
VenkatReddy
9
• The aim is to build clusters i.e divide the whole population into group of similar
objects
• What is similarity/dis-similarity?
• How do you define distance between two clusters
Dissimilarity & Similarity
DataAnalysisCourse
VenkatReddy
10
Weight
Cust1 68
Cust2 72
Cust3 100
Weight Age
Cust1 68 25
Cust2 72 70
Cust3 100 28
Weight Age Income
Cust1 68 25 60,000
Cust2 72 70 9,000
Cust3 100 28 62,000
Which two customers are similar?
Which two customers are similar now?
Which two customers are similar in
this case?
Quantify dissimilarity-Distancemeasures
• To measure similarity between two observations a
distance measure is needed. With a single variable,
similarity is straightforward
• Example: income – two individuals are similar if their income
level is similar and the level of dissimilarity increases as the
income gap increases
• Multiple variables require an aggregate distance
measure
• Many characteristics (e.g. income, age, consumption habits,
family composition, owning a car, education level, job…), it
becomes more difficult to define similarity with a single value
• The most known measure of distance is the Euclidean
distance, which is the concept we use in everyday life for
spatial coordinates.
DataAnalysisCourse
VenkatReddy
11
Examples of distances
DataAnalysisCourse
VenkatReddy
12
 
2
1
n
ij ki kj
k
D x x

 
1
n
ij ki kj
k
D x x

 
Euclidean distance
City-block (Manhattan) distance
A
B
A
B
Dij distance between cases i and j xkj - value of variable xk for case j
Other distance measures: Chebychev, Minkowski, Mahalanobis,
maximum distance, cosine similarity, simple correlation between
observations etc.,


















npx...nfx...n1x
...............
ipx...ifx...i1x
...............
1px...1fx...11x
















0...)2,()1,(
:::
)2,3()
...ndnd
0dd(3,1
0d(2,1)
0
Data matrix Dissimilarity matrix
Calculating the distance
DataAnalysisCourse
VenkatReddy
13
Weight
Cust1 68
Cust2 72
Cust3 100
• Cust1 vs Cust2 :- (68-72)= 4
• Cust2 vs Cust3 :- (72-100) = 28
• Cust3 vs Cust1 :- (100-68) =32
Weight Age
Cust1 68 25
Cust2 72 70
Cust3 100 28
• Cust1 vs Cust2 :- sqrt((68-72)^2 + (25-70)^2)=44.9
• Cust2 vs Cust3 :- 50.54
• Cust3 vs Cust1 :- 32.14
Demo: Calculation of distance
proc distance data=cust_data out=Dist method=Euclid nostd;
var interval(Credit_score Expenses);
run;
proc print data=Dist;
run;
DataAnalysisCourse
VenkatReddy
14
Lab: Distance Calculation
proc distance data=cust_data out=Count_Dist method=Euclid
nostd;
var interval(Area_Sq_Miles_ GDP_MM_ Unemp_rate);
run;
proc print data=Count_Dist;
run;
DataAnalysisCourse
VenkatReddy
15
Clustering algorithms
• k-means clustering algorithm
• Fuzzy c-means clustering algorithm
• Hierarchical clustering algorithm
• Gaussian(EM) clustering algorithm
• Quality Threshold (QT) clustering algorithm
• MST based clustering algorithm
• Density based clustering algorithm
• kernel k-means clustering algorithm
DataAnalysisCourse
VenkatReddy
16
K -Means Clustering – Algorithm
1. The number k of clusters is fixed
2. An initial set of k “seeds” (aggregation centres) is provided
1. First k elements
2. Other seeds (randomly selected or explicitly defined)
3. Given a certain fixed threshold, all units are assigned to the
nearest cluster seed
4. New seeds are computed
5. Go back to step 3 until no reclassification is necessary
Or simply
Initialize k cluster centers
Do
Assignment step: Assign each data point to its closest cluster center
Re-estimation step: Re-compute cluster centers
While (there are still changes in the cluster centers)
DataAnalysisCourse
VenkatReddy
17
K-Means clustering
DataAnalysisCourse
VenkatReddy
18
Overall population
K-Means clustering
DataAnalysisCourse
VenkatReddy
19
Fix the number of clusters
K-Means clustering
DataAnalysisCourse
VenkatReddy
20
Calculate the distance of
each case from all clusters
K-Means clustering
DataAnalysisCourse
VenkatReddy
21
Assign each case to nearest
cluster
K-Means clustering
DataAnalysisCourse
VenkatReddy
22
Re calculate the cluster
centers
K-Means clustering
DataAnalysisCourse
VenkatReddy
23
K-Means clustering
DataAnalysisCourse
VenkatReddy
24
K-Means clustering
DataAnalysisCourse
VenkatReddy
25
K-Means clustering
DataAnalysisCourse
VenkatReddy
26
K-Means clustering
DataAnalysisCourse
VenkatReddy
27
K-Means clustering
DataAnalysisCourse
VenkatReddy
28
K-Means clustering
DataAnalysisCourse
VenkatReddy
29
Reassign after changing the
cluster centers
K-Means clustering
DataAnalysisCourse
VenkatReddy
30
K-Means clustering
DataAnalysisCourse
VenkatReddy
31
Continue till there is no
significant change between
two iterations
K Means clustering in action
DataAnalysisCourse
VenkatReddy
32
• Dividing the data into 10 clusters using K-Means
Distance metric will
decide cluster for
these points
K-Means Clustering SAS Demo
proc fastclus data= sup_market radius=0 replace=full
maxclusters =5 maxiter =20 distance out=clustr_out;
id cust_id;
Var age family_size income spend visit_Other_shops;
run;
DataAnalysisCourse
VenkatReddy
33
• A Supermarket wanted to send some promotional coupons to 100
families
• The idea is to identify 100 customers with medium income and low
recent spends
Lab: K- Means Clustering
• Download contact center agents data
• The performance data contains
• Average handling time
• Average number of calls
• CSAT
• Resolution score
• Identify top 10 agents for promotion based on below criteria
• High C_SAT
• High Resolution
• Low Average handling time
• High number of calls
DataAnalysisCourse
VenkatReddy
34
SAS Code Options
• The RADIUS= option establishes the minimum distance criterion for
selecting new seeds. No observation is considered as a new seed unless its
minimum distance to previous seeds exceeds the value given by the
RADIUS= option. The default value is 0.
• The MAXCLUSTERS= option specifies the maximum number of clusters
allowed. If you omit the MAXCLUSTERS= option, a value of 100 is assumed.
• The REPLACE= option specifies how seed replacement is performed.
• FULL :requests default seed replacement.
• PART :requests seed replacement only when the distance between the
observation and the closest seed is greater than the minimum distance between
seeds.
• NONE : suppresses seed replacement.
• RANDOM :Selects a simple pseudo-random sample of complete observations as
initial cluster seeds.
DataAnalysisCourse
VenkatReddy
35
SAS Code & Options
• The MAXITER= option specifies the maximum number of iterations for re
computing cluster seeds. When the value of the MAXITER= option is greater
than 0, each observation is assigned to the nearest seed, and the seeds are
recomputed as the means of the clusters.
• The LIST option lists all observations, giving the value of the ID variable (if
any), the number of the cluster to which the observation is assigned, and
the distance between the observation and the final cluster seed.
• The DISTANCE option computes distances between the cluster means.
• The ID variable, which can be character or numeric, identifies observations
on the output when you specify the LIST option.
• The VAR statement lists the numeric variables to be used in the cluster
analysis. If you omit the VAR statement, all numeric variables not listed in
other statements are used.
DataAnalysisCourse
VenkatReddy
36
Distance between Clusters
• Single link: smallest distance between an element in one cluster and an
element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)
• Complete link: largest distance between an element in one cluster and an
element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)
• Average: avg distance between an element in one cluster and an element in
the other, i.e., dist(Ki, Kj) = avg(tip, tjq)
• Centroid: distance between the centroids of two clusters, i.e., dist(Ki, Kj) =
dist(Ci, Cj)
• Medoid: distance between the medoids of two clusters, i.e., dist(Ki, Kj) =
dist(Mi, Mj) Medoid: a chosen, centrally located object in the cluster
DataAnalysisCourse
VenkatReddy
37
X X
SAS output interpretation
• RMSSTD - Pooled standard deviation of all the variables forming the
cluster.(Variance within a cluster) Since the objective of cluster analysis is to
form homogeneous groups, the
• RMSSTD of a cluster should be as small as possible
• SPRSQ -Semipartial R-squared is a measure of the homogeneity of merged
clusters, so SPRSQ is the loss of homogeneity due to combining two groups
or clusters to form a new group or cluster. (error incurred by combining two
groups)
• Thus, the SPRSQ value should be small to imply that we are merging two
homogeneous groups
DataAnalysisCourse
VenkatReddy
38
SAS output interpretation
• RSQ (R-squared) measures the extent to which groups or clusters
are different from each other. (Variance between the clusters)
• So, when you have just one cluster RSQ value is, intuitively, zero).
Thus, the RSQ value should be high.
• Centroid Distance is simply the Euclidian distance between the
centroid of the two clusters that are to be joined or merged.
• So, Centroid Distance is a measure of the homogeneity of merged
clusters and the value should be small.
DataAnalysisCourse
VenkatReddy
39
Distance Calculation on
standardized data
DataAnalysisCourse
VenkatReddy
40
Weight Income
Cust1 68 60,000
Cust2 72 9,000
Cust3 100 62,000
Average 80 43667
Stdev 14 24527
Weight Income
Cust1 -0.84 0.67
Cust2 -0.56 -1.41
Cust3 1.40 0.75

Weitere ähnliche Inhalte

Was ist angesagt?

Clustering, k-means clustering
Clustering, k-means clusteringClustering, k-means clustering
Clustering, k-means clusteringMegha Sharma
 
Linear Discriminant Analysis and Its Generalization
Linear Discriminant Analysis and Its GeneralizationLinear Discriminant Analysis and Its Generalization
Linear Discriminant Analysis and Its Generalization일상 온
 
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...Simplilearn
 
Hierarchical clustering
Hierarchical clustering Hierarchical clustering
Hierarchical clustering Ashek Farabi
 
Presentation on unsupervised learning
Presentation on unsupervised learning Presentation on unsupervised learning
Presentation on unsupervised learning ANKUSH PAL
 
K means clustering
K means clusteringK means clustering
K means clusteringKuppusamy P
 
3.3 hierarchical methods
3.3 hierarchical methods3.3 hierarchical methods
3.3 hierarchical methodsKrish_ver2
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysisguest0edcaf
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsMd. Main Uddin Rony
 
. An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic .... An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic ...butest
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learningAnil Yadav
 

Was ist angesagt? (20)

Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Hierarchical Clustering
Hierarchical ClusteringHierarchical Clustering
Hierarchical Clustering
 
Clustering, k-means clustering
Clustering, k-means clusteringClustering, k-means clustering
Clustering, k-means clustering
 
Linear Discriminant Analysis and Its Generalization
Linear Discriminant Analysis and Its GeneralizationLinear Discriminant Analysis and Its Generalization
Linear Discriminant Analysis and Its Generalization
 
Hierachical clustering
Hierachical clusteringHierachical clustering
Hierachical clustering
 
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysis
 
Hierarchical clustering
Hierarchical clustering Hierarchical clustering
Hierarchical clustering
 
Presentation on unsupervised learning
Presentation on unsupervised learning Presentation on unsupervised learning
Presentation on unsupervised learning
 
K means clustering
K means clusteringK means clustering
K means clustering
 
KNN
KNN KNN
KNN
 
3.3 hierarchical methods
3.3 hierarchical methods3.3 hierarchical methods
3.3 hierarchical methods
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysis
 
Decision tree
Decision treeDecision tree
Decision tree
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Clustering
ClusteringClustering
Clustering
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 
Clusters techniques
Clusters techniquesClusters techniques
Clusters techniques
 
. An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic .... An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic ...
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning
 

Andere mochten auch

Model selection and cross validation techniques
Model selection and cross validation techniquesModel selection and cross validation techniques
Model selection and cross validation techniquesVenkata Reddy Konasani
 
Individual movements and geographical data mining. Clustering algorithms for ...
Individual movements and geographical data mining. Clustering algorithms for ...Individual movements and geographical data mining. Clustering algorithms for ...
Individual movements and geographical data mining. Clustering algorithms for ...Beniamino Murgante
 
Homotopic Frechet Distance Between Curves
Homotopic Frechet Distance Between CurvesHomotopic Frechet Distance Between Curves
Homotopic Frechet Distance Between Curvesshripadthite
 
Spatio-Temporal Data Mining and Classification of Ships' Trajectories
Spatio-Temporal Data Mining and Classification of Ships' TrajectoriesSpatio-Temporal Data Mining and Classification of Ships' Trajectories
Spatio-Temporal Data Mining and Classification of Ships' TrajectoriesCentre of Geographic Sciences (COGS)
 
Trajectory clustering - Traclus Algorithm
Trajectory clustering - Traclus AlgorithmTrajectory clustering - Traclus Algorithm
Trajectory clustering - Traclus AlgorithmIván Sanchez Vera
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Salah Amean
 
Learning Tableau - Data, Graphs, Filters, Dashboards and Advanced features
Learning Tableau -  Data, Graphs, Filters, Dashboards and Advanced featuresLearning Tableau -  Data, Graphs, Filters, Dashboards and Advanced features
Learning Tableau - Data, Graphs, Filters, Dashboards and Advanced featuresVenkata Reddy Konasani
 

Andere mochten auch (8)

Model selection and cross validation techniques
Model selection and cross validation techniquesModel selection and cross validation techniques
Model selection and cross validation techniques
 
Individual movements and geographical data mining. Clustering algorithms for ...
Individual movements and geographical data mining. Clustering algorithms for ...Individual movements and geographical data mining. Clustering algorithms for ...
Individual movements and geographical data mining. Clustering algorithms for ...
 
Homotopic Frechet Distance Between Curves
Homotopic Frechet Distance Between CurvesHomotopic Frechet Distance Between Curves
Homotopic Frechet Distance Between Curves
 
Spatio-Temporal Data Mining and Classification of Ships' Trajectories
Spatio-Temporal Data Mining and Classification of Ships' TrajectoriesSpatio-Temporal Data Mining and Classification of Ships' Trajectories
Spatio-Temporal Data Mining and Classification of Ships' Trajectories
 
Trajectory clustering - Traclus Algorithm
Trajectory clustering - Traclus AlgorithmTrajectory clustering - Traclus Algorithm
Trajectory clustering - Traclus Algorithm
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
 
Learning Tableau - Data, Graphs, Filters, Dashboards and Advanced features
Learning Tableau -  Data, Graphs, Filters, Dashboards and Advanced featuresLearning Tableau -  Data, Graphs, Filters, Dashboards and Advanced features
Learning Tableau - Data, Graphs, Filters, Dashboards and Advanced features
 
GBM theory code and parameters
GBM theory code and parametersGBM theory code and parameters
GBM theory code and parameters
 

Ähnlich wie Cluster Analysis for Dummies

26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.pptvikassingh569137
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in RSudhakar Chavan
 
Clustering (from Google)
Clustering (from Google)Clustering (from Google)
Clustering (from Google)Sri Prasanna
 
CS8091_BDA_Unit_II_Clustering
CS8091_BDA_Unit_II_ClusteringCS8091_BDA_Unit_II_Clustering
CS8091_BDA_Unit_II_ClusteringPalani Kumar
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxShwetapadmaBabu1
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3Nandhini S
 
Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsNithyananthSengottai
 
MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxnikshaikh786
 
Optimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmOptimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmIJERA Editor
 
Premeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringPremeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringIJCSIS Research Publications
 
Clustering techniques
Clustering techniquesClustering techniques
Clustering techniquestalktoharry
 
DS9 - Clustering.pptx
DS9 - Clustering.pptxDS9 - Clustering.pptx
DS9 - Clustering.pptxJK970901
 
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptChapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptSubrata Kumer Paul
 

Ähnlich wie Cluster Analysis for Dummies (20)

26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt
 
Vi sem
Vi semVi sem
Vi sem
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
 
Knn 160904075605-converted
Knn 160904075605-convertedKnn 160904075605-converted
Knn 160904075605-converted
 
Clustering.pptx
Clustering.pptxClustering.pptx
Clustering.pptx
 
Clustering (from Google)
Clustering (from Google)Clustering (from Google)
Clustering (from Google)
 
CS8091_BDA_Unit_II_Clustering
CS8091_BDA_Unit_II_ClusteringCS8091_BDA_Unit_II_Clustering
CS8091_BDA_Unit_II_Clustering
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptx
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
 
Neural nw k means
Neural nw k meansNeural nw k means
Neural nw k means
 
Introduction to data mining and machine learning
Introduction to data mining and machine learningIntroduction to data mining and machine learning
Introduction to data mining and machine learning
 
Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering concepts
 
MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptx
 
UNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptxUNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptx
 
Optimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmOptimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering Algorithm
 
Premeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringPremeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means Clustering
 
DM_clustering.ppt
DM_clustering.pptDM_clustering.ppt
DM_clustering.ppt
 
Clustering techniques
Clustering techniquesClustering techniques
Clustering techniques
 
DS9 - Clustering.pptx
DS9 - Clustering.pptxDS9 - Clustering.pptx
DS9 - Clustering.pptx
 
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptChapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
 

Mehr von Venkata Reddy Konasani

Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Venkata Reddy Konasani
 
Table of Contents - Practical Business Analytics using SAS
Table of Contents - Practical Business Analytics using SAS Table of Contents - Practical Business Analytics using SAS
Table of Contents - Practical Business Analytics using SAS Venkata Reddy Konasani
 
Data exploration validation and sanitization
Data exploration validation and sanitizationData exploration validation and sanitization
Data exploration validation and sanitizationVenkata Reddy Konasani
 
Introduction to predictive modeling v1
Introduction to predictive modeling v1Introduction to predictive modeling v1
Introduction to predictive modeling v1Venkata Reddy Konasani
 

Mehr von Venkata Reddy Konasani (20)

Transformers 101
Transformers 101 Transformers 101
Transformers 101
 
Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science
 
Neural Network Part-2
Neural Network Part-2Neural Network Part-2
Neural Network Part-2
 
Neural Networks made easy
Neural Networks made easyNeural Networks made easy
Neural Networks made easy
 
Decision tree
Decision treeDecision tree
Decision tree
 
Step By Step Guide to Learn R
Step By Step Guide to Learn RStep By Step Guide to Learn R
Step By Step Guide to Learn R
 
Credit Risk Model Building Steps
Credit Risk Model Building StepsCredit Risk Model Building Steps
Credit Risk Model Building Steps
 
Table of Contents - Practical Business Analytics using SAS
Table of Contents - Practical Business Analytics using SAS Table of Contents - Practical Business Analytics using SAS
Table of Contents - Practical Business Analytics using SAS
 
SAS basics Step by step learning
SAS basics Step by step learningSAS basics Step by step learning
SAS basics Step by step learning
 
Testing of hypothesis case study
Testing of hypothesis case study Testing of hypothesis case study
Testing of hypothesis case study
 
L101 predictive modeling case_study
L101 predictive modeling case_studyL101 predictive modeling case_study
L101 predictive modeling case_study
 
Machine Learning for Dummies
Machine Learning for DummiesMachine Learning for Dummies
Machine Learning for Dummies
 
Online data sources for analaysis
Online data sources for analaysis Online data sources for analaysis
Online data sources for analaysis
 
A data analyst view of Bigdata
A data analyst view of Bigdata A data analyst view of Bigdata
A data analyst view of Bigdata
 
R- Introduction
R- IntroductionR- Introduction
R- Introduction
 
Data exploration validation and sanitization
Data exploration validation and sanitizationData exploration validation and sanitization
Data exploration validation and sanitization
 
ARIMA
ARIMA ARIMA
ARIMA
 
Introduction to predictive modeling v1
Introduction to predictive modeling v1Introduction to predictive modeling v1
Introduction to predictive modeling v1
 
Big data Introduction by Mohan
Big data Introduction by MohanBig data Introduction by Mohan
Big data Introduction by Mohan
 
Data Analyst - Interview Guide
Data Analyst - Interview GuideData Analyst - Interview Guide
Data Analyst - Interview Guide
 

Kürzlich hochgeladen

Practical Research 1 Lesson 9 Scope and delimitation.pptx
Practical Research 1 Lesson 9 Scope and delimitation.pptxPractical Research 1 Lesson 9 Scope and delimitation.pptx
Practical Research 1 Lesson 9 Scope and delimitation.pptxKatherine Villaluna
 
Drug Information Services- DIC and Sources.
Drug Information Services- DIC and Sources.Drug Information Services- DIC and Sources.
Drug Information Services- DIC and Sources.raviapr7
 
Philosophy of Education and Educational Philosophy
Philosophy of Education  and Educational PhilosophyPhilosophy of Education  and Educational Philosophy
Philosophy of Education and Educational PhilosophyShuvankar Madhu
 
NOTES OF DRUGS ACTING ON NERVOUS SYSTEM .pdf
NOTES OF DRUGS ACTING ON NERVOUS SYSTEM .pdfNOTES OF DRUGS ACTING ON NERVOUS SYSTEM .pdf
NOTES OF DRUGS ACTING ON NERVOUS SYSTEM .pdfSumit Tiwari
 
Easter in the USA presentation by Chloe.
Easter in the USA presentation by Chloe.Easter in the USA presentation by Chloe.
Easter in the USA presentation by Chloe.EnglishCEIPdeSigeiro
 
Diploma in Nursing Admission Test Question Solution 2023.pdf
Diploma in Nursing Admission Test Question Solution 2023.pdfDiploma in Nursing Admission Test Question Solution 2023.pdf
Diploma in Nursing Admission Test Question Solution 2023.pdfMohonDas
 
UKCGE Parental Leave Discussion March 2024
UKCGE Parental Leave Discussion March 2024UKCGE Parental Leave Discussion March 2024
UKCGE Parental Leave Discussion March 2024UKCGE
 
CAULIFLOWER BREEDING 1 Parmar pptx
CAULIFLOWER BREEDING 1 Parmar pptxCAULIFLOWER BREEDING 1 Parmar pptx
CAULIFLOWER BREEDING 1 Parmar pptxSaurabhParmar42
 
2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx
2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx
2024.03.23 What do successful readers do - Sandy Millin for PARK.pptxSandy Millin
 
Ultra structure and life cycle of Plasmodium.pptx
Ultra structure and life cycle of Plasmodium.pptxUltra structure and life cycle of Plasmodium.pptx
Ultra structure and life cycle of Plasmodium.pptxDr. Asif Anas
 
Latin American Revolutions, c. 1789-1830
Latin American Revolutions, c. 1789-1830Latin American Revolutions, c. 1789-1830
Latin American Revolutions, c. 1789-1830Dave Phillips
 
How to Filter Blank Lines in Odoo 17 Accounting
How to Filter Blank Lines in Odoo 17 AccountingHow to Filter Blank Lines in Odoo 17 Accounting
How to Filter Blank Lines in Odoo 17 AccountingCeline George
 
Benefits & Challenges of Inclusive Education
Benefits & Challenges of Inclusive EducationBenefits & Challenges of Inclusive Education
Benefits & Challenges of Inclusive EducationMJDuyan
 
M-2- General Reactions of amino acids.pptx
M-2- General Reactions of amino acids.pptxM-2- General Reactions of amino acids.pptx
M-2- General Reactions of amino acids.pptxDr. Santhosh Kumar. N
 
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRADUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRATanmoy Mishra
 
The Singapore Teaching Practice document
The Singapore Teaching Practice documentThe Singapore Teaching Practice document
The Singapore Teaching Practice documentXsasf Sfdfasd
 
Human-AI Co-Creation of Worked Examples for Programming Classes
Human-AI Co-Creation of Worked Examples for Programming ClassesHuman-AI Co-Creation of Worked Examples for Programming Classes
Human-AI Co-Creation of Worked Examples for Programming ClassesMohammad Hassany
 
Practical Research 1: Lesson 8 Writing the Thesis Statement.pptx
Practical Research 1: Lesson 8 Writing the Thesis Statement.pptxPractical Research 1: Lesson 8 Writing the Thesis Statement.pptx
Practical Research 1: Lesson 8 Writing the Thesis Statement.pptxKatherine Villaluna
 

Kürzlich hochgeladen (20)

Practical Research 1 Lesson 9 Scope and delimitation.pptx
Practical Research 1 Lesson 9 Scope and delimitation.pptxPractical Research 1 Lesson 9 Scope and delimitation.pptx
Practical Research 1 Lesson 9 Scope and delimitation.pptx
 
Drug Information Services- DIC and Sources.
Drug Information Services- DIC and Sources.Drug Information Services- DIC and Sources.
Drug Information Services- DIC and Sources.
 
Philosophy of Education and Educational Philosophy
Philosophy of Education  and Educational PhilosophyPhilosophy of Education  and Educational Philosophy
Philosophy of Education and Educational Philosophy
 
NOTES OF DRUGS ACTING ON NERVOUS SYSTEM .pdf
NOTES OF DRUGS ACTING ON NERVOUS SYSTEM .pdfNOTES OF DRUGS ACTING ON NERVOUS SYSTEM .pdf
NOTES OF DRUGS ACTING ON NERVOUS SYSTEM .pdf
 
Easter in the USA presentation by Chloe.
Easter in the USA presentation by Chloe.Easter in the USA presentation by Chloe.
Easter in the USA presentation by Chloe.
 
Diploma in Nursing Admission Test Question Solution 2023.pdf
Diploma in Nursing Admission Test Question Solution 2023.pdfDiploma in Nursing Admission Test Question Solution 2023.pdf
Diploma in Nursing Admission Test Question Solution 2023.pdf
 
UKCGE Parental Leave Discussion March 2024
UKCGE Parental Leave Discussion March 2024UKCGE Parental Leave Discussion March 2024
UKCGE Parental Leave Discussion March 2024
 
CAULIFLOWER BREEDING 1 Parmar pptx
CAULIFLOWER BREEDING 1 Parmar pptxCAULIFLOWER BREEDING 1 Parmar pptx
CAULIFLOWER BREEDING 1 Parmar pptx
 
Personal Resilience in Project Management 2 - TV Edit 1a.pdf
Personal Resilience in Project Management 2 - TV Edit 1a.pdfPersonal Resilience in Project Management 2 - TV Edit 1a.pdf
Personal Resilience in Project Management 2 - TV Edit 1a.pdf
 
2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx
2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx
2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx
 
Ultra structure and life cycle of Plasmodium.pptx
Ultra structure and life cycle of Plasmodium.pptxUltra structure and life cycle of Plasmodium.pptx
Ultra structure and life cycle of Plasmodium.pptx
 
Latin American Revolutions, c. 1789-1830
Latin American Revolutions, c. 1789-1830Latin American Revolutions, c. 1789-1830
Latin American Revolutions, c. 1789-1830
 
Prelims of Kant get Marx 2.0: a general politics quiz
Prelims of Kant get Marx 2.0: a general politics quizPrelims of Kant get Marx 2.0: a general politics quiz
Prelims of Kant get Marx 2.0: a general politics quiz
 
How to Filter Blank Lines in Odoo 17 Accounting
How to Filter Blank Lines in Odoo 17 AccountingHow to Filter Blank Lines in Odoo 17 Accounting
How to Filter Blank Lines in Odoo 17 Accounting
 
Benefits & Challenges of Inclusive Education
Benefits & Challenges of Inclusive EducationBenefits & Challenges of Inclusive Education
Benefits & Challenges of Inclusive Education
 
M-2- General Reactions of amino acids.pptx
M-2- General Reactions of amino acids.pptxM-2- General Reactions of amino acids.pptx
M-2- General Reactions of amino acids.pptx
 
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRADUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
 
The Singapore Teaching Practice document
The Singapore Teaching Practice documentThe Singapore Teaching Practice document
The Singapore Teaching Practice document
 
Human-AI Co-Creation of Worked Examples for Programming Classes
Human-AI Co-Creation of Worked Examples for Programming ClassesHuman-AI Co-Creation of Worked Examples for Programming Classes
Human-AI Co-Creation of Worked Examples for Programming Classes
 
Practical Research 1: Lesson 8 Writing the Thesis Statement.pptx
Practical Research 1: Lesson 8 Writing the Thesis Statement.pptxPractical Research 1: Lesson 8 Writing the Thesis Statement.pptx
Practical Research 1: Lesson 8 Writing the Thesis Statement.pptx
 

Cluster Analysis for Dummies

  • 1. Data Analysis Course Cluster Analysis Venkat Reddy
  • 2. Contents • What is the need of Segmentation • Introduction to Segmentation & Cluster analysis • Applications of Cluster Analysis • Types of Clusters • K-Means clustering DataAnalysisCourse VenkatReddy 2
  • 3. What is the need of segmentation? Problem: • 10,000 Customers - we know their age, city name, income, employment status, designation • You have to sell 100 Blackberry phones(each costs $1000) to the people in this group. You have maximum of 7 days • If you start giving demos to each individual, 10,000 demos will take more than one year. How will you sell maximum number of phones by giving minimum number of demos? DataAnalysisCourse VenkatReddy 3
  • 4. What is the need of segmentation? Solution • Divide the whole population into two groups employed / unemployed • Further divide the employed population into two groups high/low salary • Further divide that group into high /low designation DataAnalysisCourse VenkatReddy 4 10000 customers Unemployed 3000 Employed 7000 Low salary 5000 High Salary 2000 Low Designation 1800 High Designation 200
  • 5. Segmentation and Cluster Analysis • Cluster is a group of similar objects (cases, points, observations, examples, members, customers, patients, locations, etc) • Finding the groups of cases/observations/ objects in the population such that the objects are • Homogeneous within the group (high intra-class similarity) • Heterogeneous between the groups(low inter-class similarity ) DataAnalysisCourse VenkatReddy 5 Inter-cluster distances are maximized Intra-cluster distances are minimized DataAnalysisCourse VenkatReddy
  • 6. Applications of Cluster Analysis • Market Segmentation: Grouping people (with the willingness, purchasing power, and the authority to buy) according to their similarity in several dimensions related to a product under consideration. • Sales Segmentation: Clustering can tell you what types of customers buy what products • Credit Risk: Segmentation of customers based on their credit history • Operations: High performer segmentation & promotions based on person’s performance • Insurance: Identifying groups of motor insurance policy holders with a high average claim cost. • City-planning: Identifying groups of houses according to their house type, value, and geographical location • Geographical: Identification of areas of similar land use in an earth observation database. DataAnalysisCourse VenkatReddy 6
  • 7. Types of Clusters DataAnalysisCourse VenkatReddy 7 • Partitional clustering or non-hierarchical : A division of objects into non-overlapping subsets (clusters) such that each object is in exactly one cluster • The non-hierarchical methods divide a dataset of N objects into M clusters. • K-means clustering, a non-hierarchical technique, is the most commonly used one in business analytics • Hierarchical clustering: A set of nested clusters organized as a hierarchical tree • The hierarchical methods produce a set of nested clusters in which each pair of objects or clusters is progressively nested in a larger cluster until only one cluster remains • CHAID tree is most widely used in business analytics
  • 8. Cluster Analysis -Example DataAnalysisCourse VenkatReddy 8 Maths Science Gk Apt Student-1 94 82 87 89 Student-2 46 67 33 72 Student-3 98 97 93 100 Student-4 14 5 7 24 Student-5 86 97 95 95 Student-6 34 32 75 66 Student-7 69 44 59 55 Student-8 85 90 96 89 Student-9 24 26 15 22 Maths Science Gk Apt Student-1 94 82 87 89 Student-2 46 67 33 72 Student-3 98 97 93 100 Student-4 14 5 7 24 Student-5 86 97 95 95 Student-6 34 32 75 66 Student-7 69 44 59 55 Student-8 85 90 96 89 Student-9 24 26 15 22 Maths Science Gk Apt Student-4 14 5 7 24 Student-9 24 26 15 22 Student-6 34 32 75 66 Student-2 46 67 33 72 Student-7 69 44 59 55 Student-8 85 90 96 89 Student-5 86 97 95 95 Student-1 94 82 87 89 Student-3 98 97 93 100 4,9,6 2,7 8,5,1,3
  • 9. Building Clusters 1. Select a distance measure 2. Select a clustering algorithm 3. Define the distance between two clusters 4. Determine the number of clusters 5. Validate the analysis DataAnalysisCourse VenkatReddy 9 • The aim is to build clusters i.e divide the whole population into group of similar objects • What is similarity/dis-similarity? • How do you define distance between two clusters
  • 10. Dissimilarity & Similarity DataAnalysisCourse VenkatReddy 10 Weight Cust1 68 Cust2 72 Cust3 100 Weight Age Cust1 68 25 Cust2 72 70 Cust3 100 28 Weight Age Income Cust1 68 25 60,000 Cust2 72 70 9,000 Cust3 100 28 62,000 Which two customers are similar? Which two customers are similar now? Which two customers are similar in this case?
  • 11. Quantify dissimilarity-Distancemeasures • To measure similarity between two observations a distance measure is needed. With a single variable, similarity is straightforward • Example: income – two individuals are similar if their income level is similar and the level of dissimilarity increases as the income gap increases • Multiple variables require an aggregate distance measure • Many characteristics (e.g. income, age, consumption habits, family composition, owning a car, education level, job…), it becomes more difficult to define similarity with a single value • The most known measure of distance is the Euclidean distance, which is the concept we use in everyday life for spatial coordinates. DataAnalysisCourse VenkatReddy 11
  • 12. Examples of distances DataAnalysisCourse VenkatReddy 12   2 1 n ij ki kj k D x x    1 n ij ki kj k D x x    Euclidean distance City-block (Manhattan) distance A B A B Dij distance between cases i and j xkj - value of variable xk for case j Other distance measures: Chebychev, Minkowski, Mahalanobis, maximum distance, cosine similarity, simple correlation between observations etc.,                   npx...nfx...n1x ............... ipx...ifx...i1x ............... 1px...1fx...11x                 0...)2,()1,( ::: )2,3() ...ndnd 0dd(3,1 0d(2,1) 0 Data matrix Dissimilarity matrix
  • 13. Calculating the distance DataAnalysisCourse VenkatReddy 13 Weight Cust1 68 Cust2 72 Cust3 100 • Cust1 vs Cust2 :- (68-72)= 4 • Cust2 vs Cust3 :- (72-100) = 28 • Cust3 vs Cust1 :- (100-68) =32 Weight Age Cust1 68 25 Cust2 72 70 Cust3 100 28 • Cust1 vs Cust2 :- sqrt((68-72)^2 + (25-70)^2)=44.9 • Cust2 vs Cust3 :- 50.54 • Cust3 vs Cust1 :- 32.14
  • 14. Demo: Calculation of distance proc distance data=cust_data out=Dist method=Euclid nostd; var interval(Credit_score Expenses); run; proc print data=Dist; run; DataAnalysisCourse VenkatReddy 14
  • 15. Lab: Distance Calculation proc distance data=cust_data out=Count_Dist method=Euclid nostd; var interval(Area_Sq_Miles_ GDP_MM_ Unemp_rate); run; proc print data=Count_Dist; run; DataAnalysisCourse VenkatReddy 15
  • 16. Clustering algorithms • k-means clustering algorithm • Fuzzy c-means clustering algorithm • Hierarchical clustering algorithm • Gaussian(EM) clustering algorithm • Quality Threshold (QT) clustering algorithm • MST based clustering algorithm • Density based clustering algorithm • kernel k-means clustering algorithm DataAnalysisCourse VenkatReddy 16
  • 17. K -Means Clustering – Algorithm 1. The number k of clusters is fixed 2. An initial set of k “seeds” (aggregation centres) is provided 1. First k elements 2. Other seeds (randomly selected or explicitly defined) 3. Given a certain fixed threshold, all units are assigned to the nearest cluster seed 4. New seeds are computed 5. Go back to step 3 until no reclassification is necessary Or simply Initialize k cluster centers Do Assignment step: Assign each data point to its closest cluster center Re-estimation step: Re-compute cluster centers While (there are still changes in the cluster centers) DataAnalysisCourse VenkatReddy 17
  • 31. K-Means clustering DataAnalysisCourse VenkatReddy 31 Continue till there is no significant change between two iterations
  • 32. K Means clustering in action DataAnalysisCourse VenkatReddy 32 • Dividing the data into 10 clusters using K-Means Distance metric will decide cluster for these points
  • 33. K-Means Clustering SAS Demo proc fastclus data= sup_market radius=0 replace=full maxclusters =5 maxiter =20 distance out=clustr_out; id cust_id; Var age family_size income spend visit_Other_shops; run; DataAnalysisCourse VenkatReddy 33 • A Supermarket wanted to send some promotional coupons to 100 families • The idea is to identify 100 customers with medium income and low recent spends
  • 34. Lab: K- Means Clustering • Download contact center agents data • The performance data contains • Average handling time • Average number of calls • CSAT • Resolution score • Identify top 10 agents for promotion based on below criteria • High C_SAT • High Resolution • Low Average handling time • High number of calls DataAnalysisCourse VenkatReddy 34
  • 35. SAS Code Options • The RADIUS= option establishes the minimum distance criterion for selecting new seeds. No observation is considered as a new seed unless its minimum distance to previous seeds exceeds the value given by the RADIUS= option. The default value is 0. • The MAXCLUSTERS= option specifies the maximum number of clusters allowed. If you omit the MAXCLUSTERS= option, a value of 100 is assumed. • The REPLACE= option specifies how seed replacement is performed. • FULL :requests default seed replacement. • PART :requests seed replacement only when the distance between the observation and the closest seed is greater than the minimum distance between seeds. • NONE : suppresses seed replacement. • RANDOM :Selects a simple pseudo-random sample of complete observations as initial cluster seeds. DataAnalysisCourse VenkatReddy 35
  • 36. SAS Code & Options • The MAXITER= option specifies the maximum number of iterations for re computing cluster seeds. When the value of the MAXITER= option is greater than 0, each observation is assigned to the nearest seed, and the seeds are recomputed as the means of the clusters. • The LIST option lists all observations, giving the value of the ID variable (if any), the number of the cluster to which the observation is assigned, and the distance between the observation and the final cluster seed. • The DISTANCE option computes distances between the cluster means. • The ID variable, which can be character or numeric, identifies observations on the output when you specify the LIST option. • The VAR statement lists the numeric variables to be used in the cluster analysis. If you omit the VAR statement, all numeric variables not listed in other statements are used. DataAnalysisCourse VenkatReddy 36
  • 37. Distance between Clusters • Single link: smallest distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = min(tip, tjq) • Complete link: largest distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = max(tip, tjq) • Average: avg distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = avg(tip, tjq) • Centroid: distance between the centroids of two clusters, i.e., dist(Ki, Kj) = dist(Ci, Cj) • Medoid: distance between the medoids of two clusters, i.e., dist(Ki, Kj) = dist(Mi, Mj) Medoid: a chosen, centrally located object in the cluster DataAnalysisCourse VenkatReddy 37 X X
  • 38. SAS output interpretation • RMSSTD - Pooled standard deviation of all the variables forming the cluster.(Variance within a cluster) Since the objective of cluster analysis is to form homogeneous groups, the • RMSSTD of a cluster should be as small as possible • SPRSQ -Semipartial R-squared is a measure of the homogeneity of merged clusters, so SPRSQ is the loss of homogeneity due to combining two groups or clusters to form a new group or cluster. (error incurred by combining two groups) • Thus, the SPRSQ value should be small to imply that we are merging two homogeneous groups DataAnalysisCourse VenkatReddy 38
  • 39. SAS output interpretation • RSQ (R-squared) measures the extent to which groups or clusters are different from each other. (Variance between the clusters) • So, when you have just one cluster RSQ value is, intuitively, zero). Thus, the RSQ value should be high. • Centroid Distance is simply the Euclidian distance between the centroid of the two clusters that are to be joined or merged. • So, Centroid Distance is a measure of the homogeneity of merged clusters and the value should be small. DataAnalysisCourse VenkatReddy 39
  • 40. Distance Calculation on standardized data DataAnalysisCourse VenkatReddy 40 Weight Income Cust1 68 60,000 Cust2 72 9,000 Cust3 100 62,000 Average 80 43667 Stdev 14 24527 Weight Income Cust1 -0.84 0.67 Cust2 -0.56 -1.41 Cust3 1.40 0.75