
Determining the number of clusters in a data set
Determining the number of clusters in a data set, a quantity often labelled k as in the k-means algorithm, is a frequent problem in data clustering, and is a distinct issue from the process of actually solving the clustering problem.
- Comment
- enDetermining the number of clusters in a data set, a quantity often labelled k as in the k-means algorithm, is a frequent problem in data clustering, and is a distinct issue from the process of actually solving the clustering problem.
- Depiction
- Has abstract
- enDetermining the number of clusters in a data set, a quantity often labelled k as in the k-means algorithm, is a frequent problem in data clustering, and is a distinct issue from the process of actually solving the clustering problem. For a certain class of clustering algorithms (in particular k-means, k-medoids and expectation–maximization algorithm), there is a parameter commonly referred to as k that specifies the number of clusters to detect. Other algorithms such as DBSCAN and OPTICS algorithm do not require the specification of this parameter; hierarchical clustering avoids the problem altogether. The correct choice of k is often ambiguous, with interpretations depending on the shape and scale of the distribution of points in a data set and the desired clustering resolution of the user. In addition, increasing k without penalty will always reduce the amount of error in the resulting clustering, to the extreme case of zero error if each data point is considered its own cluster (i.e., when k equals the number of data points, n). Intuitively then, the optimal choice of k will strike a balance between maximum compression of the data using a single cluster, and maximum accuracy by assigning each data point to its own cluster. If an appropriate value of k is not apparent from prior knowledge of the properties of the data set, it must be chosen somehow. There are several categories of methods for making this decision.
- Hypernym
- Problem
- Is primary topic of
- Determining the number of clusters in a data set
- Label
- enDetermining the number of clusters in a data set
- Link from a Wikipage to an external page
- stackoverflow.com/a/15376462/1036500
- hal.archives-ouvertes.fr/hal-02124947/document
- www.r-statistics.com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/
- Link from a Wikipage to another Wikipage
- Akaike information criterion
- Asymptotic analysis
- Bayesian information criterion
- Category:Articles with example pseudocode
- Category:Cluster analysis
- Category:Clustering criteria
- Clustering algorithm
- Covariance
- Cross-validation (statistics)
- Data clustering
- Data mining
- Data set
- DBSCAN
- Deviance information criterion
- Dimensionality
- Dot product
- Elbow method (clustering)
- Empirical
- Expectation–maximization algorithm
- Explained variance
- Feature space
- File:DataClustering ElbowCriterion.JPG
- F-test
- Gaussian mixture model
- Genetic algorithms
- Hierarchical clustering
- Information theory
- K-means algorithm
- K-means clustering
- K-medoid
- Least squares
- Likelihood function
- Limit (mathematics)
- Mahalanobis distance
- Matrix (mathematics)
- Non-parametric statistics
- Normal distribution
- OPTICS algorithm
- R (programming language)
- Radial basis function
- Ralf Wagner
- Random variable
- Rate distortion theory
- Robert L. Thorndike
- Robert Tibshirani
- R-Project
- Silhouette (clustering)
- StackOverflow
- Trevor Hastie
- SameAs
- 4isSH
- Determining the number of clusters in a data set
- m.05syt6y
- Q5265701
- تعیین تعداد خوشهها در یک مجموعه داده
- Subject
- Category:Articles with example pseudocode
- Category:Cluster analysis
- Category:Clustering criteria
- Thumbnail
- WasDerivedFrom
- Determining the number of clusters in a data set?oldid=1120852726&ns=0
- WikiPageLength
- 19712
- Wikipage page ID
- 22324566
- Wikipage revision ID
- 1120852726
- WikiPageUsesTemplate
- Template:Main
- Template:Math
- Template:Mvar
- Template:Reflist