Determining the number of clusters in a data set

Determining the number of clusters in a data set

Determining the number of clusters in a data set, a quantity often labelled k as in the k-means algorithm, is a frequent problem in data clustering, and is a distinct issue from the process of actually solving the clustering problem.

Comment
enDetermining the number of clusters in a data set, a quantity often labelled k as in the k-means algorithm, is a frequent problem in data clustering, and is a distinct issue from the process of actually solving the clustering problem.
Depiction
DataClustering ElbowCriterion.jpg
Has abstract
enDetermining the number of clusters in a data set, a quantity often labelled k as in the k-means algorithm, is a frequent problem in data clustering, and is a distinct issue from the process of actually solving the clustering problem. For a certain class of clustering algorithms (in particular k-means, k-medoids and expectation–maximization algorithm), there is a parameter commonly referred to as k that specifies the number of clusters to detect. Other algorithms such as DBSCAN and OPTICS algorithm do not require the specification of this parameter; hierarchical clustering avoids the problem altogether. The correct choice of k is often ambiguous, with interpretations depending on the shape and scale of the distribution of points in a data set and the desired clustering resolution of the user. In addition, increasing k without penalty will always reduce the amount of error in the resulting clustering, to the extreme case of zero error if each data point is considered its own cluster (i.e., when k equals the number of data points, n). Intuitively then, the optimal choice of k will strike a balance between maximum compression of the data using a single cluster, and maximum accuracy by assigning each data point to its own cluster. If an appropriate value of k is not apparent from prior knowledge of the properties of the data set, it must be chosen somehow. There are several categories of methods for making this decision.
Hypernym
Problem
Is primary topic of
Determining the number of clusters in a data set
Label
enDetermining the number of clusters in a data set
Link from a Wikipage to an external page
stackoverflow.com/a/15376462/1036500
hal.archives-ouvertes.fr/hal-02124947/document
www.r-statistics.com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/
Link from a Wikipage to another Wikipage
Akaike information criterion
Asymptotic analysis
Bayesian information criterion
Category:Articles with example pseudocode
Category:Cluster analysis
Category:Clustering criteria
Clustering algorithm
Covariance
Cross-validation (statistics)
Data clustering
Data mining
Data set
DBSCAN
Deviance information criterion
Dimensionality
Dot product
Elbow method (clustering)
Empirical
Expectation–maximization algorithm
Explained variance
Feature space
File:DataClustering ElbowCriterion.JPG
F-test
Gaussian mixture model
Genetic algorithms
Hierarchical clustering
Information theory
K-means algorithm
K-means clustering
K-medoid
Least squares
Likelihood function
Limit (mathematics)
Mahalanobis distance
Matrix (mathematics)
Non-parametric statistics
Normal distribution
OPTICS algorithm
R (programming language)
Radial basis function
Ralf Wagner
Random variable
Rate distortion theory
Robert L. Thorndike
Robert Tibshirani
R-Project
Silhouette (clustering)
StackOverflow
Trevor Hastie
SameAs
4isSH
Determining the number of clusters in a data set
m.05syt6y
Q5265701
تعیین تعداد خوشه‌ها در یک مجموعه داده
Subject
Category:Articles with example pseudocode
Category:Cluster analysis
Category:Clustering criteria
Thumbnail
DataClustering ElbowCriterion.jpg?width=300
WasDerivedFrom
Determining the number of clusters in a data set?oldid=1120852726&ns=0
WikiPageLength
19712
Wikipage page ID
22324566
Wikipage revision ID
1120852726
WikiPageUsesTemplate
Template:Main
Template:Math
Template:Mvar
Template:Reflist