Algorithm Overview - K-Means

Algorithm Description 

The K-Means algorithm is used to perform "unsupervised" clustering of a dataset.  The term "unsupervised" means that there is no target variable to guide the analysis.  Only predictor (or independent) variables are used.  The variables are analyzed in a way that looks to group dataset records together the more similar they are (relative to the variables involved), and put records in different groups if they are not too similar.  Normally, the similarity of records is determined based on the numerical differences among variables.  These differences are usually computed using Euclidean distance calculations.  LityxIQ uses a modified version of the classical K-Means algorithm.  This allows categorical variables (such as State or Gender) to play a role in the clustering as well.

In the following picture, each mark represents a data point with two variables.  The resulting clustering shows three clusters in different colors.  The algorithm determined a way to assign one of three colors to each data point in a way that points of one color are generally close to each other, while points of different colors are generally further apart.



Generally, the way the algorithm progresses is as follows:

  1. A number of clusters to find is given.
  2. A random record from the dataset is assigned as the "starting point" for each cluster.
  3. The "center" of each cluster is determined
  4. For every data point, determine which cluster center it is closest to and assign it to that cluster.
  5. Re-compute the cluster centers.
  6. If the cluster centers did not change "too much" from the last time they were computed, then we are done.  If they did change, then go back to Steps 4 and 5.  Continue this over and over until the centers have changed very little and have become stable.


Additional Links


Lityx IQ Parameters  


Minimum/Maximum Number of Clusters - Enter the minimum and maximum number of clusters that the algorithm should attempt to find.  Based on the Criterion used (see below), LityxIQ will decide which number of clusters in that range is optimal and then build the model for that optimal number of clusters.  Note that if you want to ensure a specific number of clusters are built, say 5, you can set the minimum and maximum to that same value.

Criterion - The criterion metric used to determine which cluster size is optimal, in the case where you have a different maximum number of clusters than minimum number of clusters to create.  

  • Within Cluster SS - This criterion adds up the within cluster sum of squared distances across all clusters.  This criterion looks to minimize the differences between records placed into the same cluster, while accounting for how many clusters are being created using the "elbow" method.  A smaller value is a sign of a better clustering.
  • Silhouette - The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters.  Larger values are a sign of more optimal clustering.
  • McClain - The McClain index is defined as the ratio between the mean within cluster and between-cluster distances.  Smaller values are a sign of more optimal clustering.

Lambda - Lambda affects how heavily weighted categorical variables are in the clustering, relative to numeric variables. Increasing Lambda causes categorical variables to be more heavily-weighted.

Maximum Iterations - The maximum number of iterations the K-Means algorithm will continue searching for an improved clustering if it does not converge.  Setting this higher may lead to an improved clustering at the cost of longer run times. 

Starting Points -  This represents the number of random starting points the algorithms will use to try to create a strong clustering result. Increasing this value increases the number of different random initial guesses, but also increases run times. The algorithm will then proceed using its best initialization point.