Determining the number of clusters in a data set

Determining the number of clusters in a data set, a quantity often labelled k as in the k-means algorithm, is a frequent problem in data clustering, and is a distinct issue from the process of actually solving the clustering problem.

disease

Comment: enDetermining the number of clusters in a data set, a quantity often labelled k as in the k-means algorithm, is a frequent problem in data clustering, and is a distinct issue from the process of actually solving the clustering problem.
Depiction
Has abstract: enDetermining the number of clusters in a data set, a quantity often labelled k as in the k-means algorithm, is a frequent problem in data clustering, and is a distinct issue from the process of actually solving the clustering problem. For a certain class of clustering algorithms (in particular k-means, k-medoids and expectation–maximization algorithm), there is a parameter commonly referred to as k that specifies the number of clusters to detect. Other algorithms such as DBSCAN and OPTICS algorithm do not require the specification of this parameter; hierarchical clustering avoids the problem altogether. The correct choice of k is often ambiguous, with interpretations depending on the shape and scale of the distribution of points in a data set and the desired clustering resolution of the user. In addition, increasing k without penalty will always reduce the amount of error in the resulting clustering, to the extreme case of zero error if each data point is considered its own cluster (i.e., when k equals the number of data points, n). Intuitively then, the optimal choice of k will strike a balance between maximum compression of the data using a single cluster, and maximum accuracy by assigning each data point to its own cluster. If an appropriate value of k is not apparent from prior knowledge of the properties of the data set, it must be chosen somehow. There are several categories of methods for making this decision.
Hypernym: Problem
Is primary topic of: Determining the number of clusters in a data set
Label: enDetermining the number of clusters in a data set
Link from a Wikipage to an external page: stackoverflow.com/a/15376462/1036500; hal.archives-ouvertes.fr/hal-02124947/document; www.r-statistics.com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/
Link from a Wikipage to another Wikipage: Akaike information criterion; Asymptotic analysis; Bayesian information criterion; Category:Articles with example pseudocode; Category:Cluster analysis; Category:Clustering criteria; Clustering algorithm; Covariance; Cross-validation (statistics); Data clustering; Data mining; Data set; DBSCAN; Deviance information criterion; Dimensionality; Dot product; Elbow method (clustering); Empirical; Expectation–maximization algorithm; Explained variance; Feature space; File:DataClustering ElbowCriterion.JPG; F-test; Gaussian mixture model; Genetic algorithms; Hierarchical clustering; Information theory; K-means algorithm; K-means clustering; K-medoid; Least squares; Likelihood function; Limit (mathematics); Mahalanobis distance; Matrix (mathematics); Non-parametric statistics; Normal distribution; OPTICS algorithm; R (programming language); Radial basis function; Ralf Wagner; Random variable; Rate distortion theory; Robert L. Thorndike; Robert Tibshirani; R-Project; Silhouette (clustering); StackOverflow; Trevor Hastie
SameAs: 4isSH; Determining the number of clusters in a data set; m.05syt6y; Q5265701; تعیین تعداد خوشه‌ها در یک مجموعه داده
Subject: Category:Articles with example pseudocode; Category:Cluster analysis; Category:Clustering criteria
Thumbnail
WasDerivedFrom: Determining the number of clusters in a data set?oldid=1120852726&ns=0
WikiPageLength: 19712
Wikipage page ID: 22324566
Wikipage revision ID: 1120852726
WikiPageUsesTemplate: Template:Main; Template:Math; Template:Mvar; Template:Reflist

Determining the number of clusters in a data set

Backlinks

About

Resources

Support

Follow us