K-means

K-means is a clustering algorithm that aims to partition a set of observation points in the Euclidean space into k clusters. The clusters produced by this method has the property of each observation belonging to the cluster with the nearest mean.

  1. Input: a set of N reference points.
  2. Output: the coordinates of the centers of each of the k clusters indexed from 0 to k - 1, the membership vector of each point whose i-th component denotes the cluster index to which the point belongs to.

Note

In this section, the “mean”, “centroid”, and “center” are used integerchangeably to mean the geometric center of a set of points. For a set of points, (x_1, x_2, ..., x_N) , where each observation x_i \in R^d, the “mean” or the “cenroid” or the “center” of the set is given by: \mu = \frac{1}{N} \sum\limits_{i=1}^N x_i.

Introduction

Clustering is a form of unsupervised learning which tries to find structures in the data without using any labels or target values. Clustering partitions a set of observations into separate groupings such that an observation in a given group are more similar to another observation in the same group than to another in a different group.

Note that clustering methods require a notion of similarity. The canonical form of k-means defines the similarity measure using the Euclidean metric. For x, y, z\in R^d, x is more similar to y than z if and only if ||x - y||
\leq ||x - z||. Given a set of observations X = (x_1, x_2,
..., x_N) , where each observation x_i \in R^d, then k-means clustering aims to partition the N observations into k sets (k < N) \: S = \{S_1, S_2,...,
S_k\} with the objective to minimize the squared Euclidean distance between points in any subset and their center of mass:

\arg \min\limits_S  \sum_{i=1}^{k} \: \sum_{x_j \in S_i} ||x_j -\mu_i||^2

where \mu_i is the mean of S_i .

A simple algorithm for a fixed number of Clusters

The naive version version of the k-means algorithm (sometimes referred to as Lloyd’s algorithm) is the following: .. _algorithm:

  1. Choose k points in the geometrical space to represent the starting k cluster means. These are called the k means.
  2. Assignment Step: Assign each point in the data to the closest mean. This step involves calculating the distance between each input point in the data to each of the k means and thus has O(k \cdot N) complexity where N is the size of the input data set.
  3. Update Step: Calculate the new set of k means as that of the centre of mass of observations for their respective cluster.
  4. If the new k means are the same as the old k means i.e. nothing changed, terminate. Otherwise go to step 2.

Note

Numerical round-off errors can cause some of the k distances between a given point and the means of the k clusters to have the same value. In this case, we employ a simple tie-breaking rule to assign the point to the cluster with the lowest cluster index everytime to facilitate convergence.

Crossvalidating for the Number of Clusters

Choosing the right number of clusters is a hard problem in k-means. It is often chosen ad hoc based on prior knowledge. The following methods are the most common methods for automatically choosing the number of clusters k while clustering.

[pelleg2000extending] uses the Bayesian Information Criterion: BIC(C | X) = \mathcal{L}(X | C) - \frac{p}{2} \log N where \mathcal{L}(X | C) is the log-likelihood the dataset according to the cluster model C, p = k(d + 1) the number of parameters in the cluster model C and N is the number of points the model C. The cluster model that maximies BIC is chosen in the end.

k-means clustering assumes that each cluster comes from a unimodal distribution such as a Gaussian distribution. From this assumption, [hamerly2003learning] uses a statistical test of a Gaussian fit for a set of points to decide whether to split a given cluster into two. In this paper, the Anderson-Darling statistic is computed to perform the hypothesis testing. The following steps describe the algorithm:

  1. The algorithm produces an initial set of k_{initial} clusters, which can start from 1 or a higher value based on prior knowledge.
  2. Let k be the number of clusters. For each cluster S_k, pick two pivots \mu_1 and \mu_2 inside S_k, and project each point in S_k onto \mu_1 - \mu_2. Test whether the projected points has the univariate Gaussian distribution by using the Anderson-Darling statistic.
  3. If it does not follow the Gaussian distribution: discard the original center of S_k, \mu_k, and add \mu_1, \mu_2 to the list of centers; re-run k-means and go back to step 2. If it does follow the Gaussian distribution, then terminate.

However, note that clusters produced by practical k-means clusterings for a fixed value of k itself are only locally optimal. This is because the associated optimization problem is a mixed-integer non-convex optimization one. The procedures described above are just heuristics.

Applications of Kmeans

k-means is very popular in retail and customer segmentation [zheng-tobacco], [reynolds1999relationship].

1305 Implementation Details

The 1305 implementation of the k-means algorithms is derived from the algorithm explained in algorithm_. The technique uses kd-trees (we also support ball/metric-trees) and geometric reasoning to accelerate the assignment step (step 2) and can be used to potentially accelerate update step (step 3). We first index the data points that are to be clustered using a spatial partitioning tree.

    1. Let the hyper-rectangle H containing a subset of the data points X_S \subset X be given. We can assign X_S to the cluster C_m if \max\limits_{x \in X_S} || x - \mu_m|| < \min\limits_{x \in X_S} ||x - \mu_p|| for \forall p \not = m The cluster mean of C_m, \mu_m is said to dominate H with respect to all other cluster means. Note that any hyper-region could replace the hyper-rectangle bonding primitive, as long as there is a way to compute the minimum distance and the maximum distance between the bound and a point.
    2. If no such mean exists, we recursively verify the conditions on the subsets of X_S. When binary trees such as kd-trees are used, we test on the left and the right subsets of X_S using their bounding primitives. When a leaf node is reached, we simply assign each point in it to the closest cluster center.
    3. Note that verifying the condition above involves iterating over each data point in the hyper-rectangle, but the hyper-rectangle can be used to verify it in O(k \cdot d) time. For more details, please refer [pelleg1999accelerating].
  1. Blacklisting: In 1a. we mention that we try to find a centroid (mean) \mu_m such that it completely ‘dominates’ all other means. \mu_m as mentioned is also the mean closest to the hyper-rectangle. Yet, if \mu_m does not dominate all other means we recurse to the children of the hyper-rectangle where again the same O(k.d) computation is done. The information that \mu_m dominated some (if any but not all) of the means is lost. It can be shown that if \mu_m dominates another mean \mu_p, than \mu_m dominates all children of the hyper-rectangle owned by \mu_p as well. By passing this information down the tree we can save more computation.

Finally, if we were to store the center of mass, as cache sufficient statistics, of each hyper-rectangle, then once a dominant mean was found an iteration through all the points in this hyper-rectangle would not be necessary. A simple weighted add could be performed to facilitate the update step.

Supported Features

The current implementation implements an accelerated version of the Lloyd’s algorithm and supports the following features:

  1. The similarity measure under the weighted Euclidean metric or the Euclidean metric.
  2. Computing the k-means using the naive algorithm or the accelerated tree-version [pelleg1999accelerating].
  3. The ability to index the data points for the computation using a kd-tree or a ball-tree.
  4. The progressive computation of k-means. The k-means algorithm is an inherently iterative method, which can be made to terminate after a fixed number of iterations.

Planned Extensions

The following features are currently unsupported and will be implemented soon:

  1. Multiple restarts to avoid local minima in the produced clusters.
  2. Cross-validation.
  3. More compreshensive diagonistic outputs for the produced clusters, such as the covariance of each cluster.
_images/kmeans_colored.png

Figure 1- Visualization of the hyper-rectangles owned by centers. All points that belong to a specific center are colored the same color.

[hamerly2003learning]“Learning the k in k-means”, Hamerly, G. and Elkan, C. In Advances in Neural Information Processing Systems 2003.
[pelleg1999accelerating](1, 2) “Accelerating exact k-means algorithms with geometric reasoning”, Pelleg, D. and Moore, A. Proceedings of the fifth ACM SIGKDD 1999.
[pelleg2000extending]“X-means: Extending K-means with efficient estimation of the number of clusters”, In Proceedings of the 17th International Conf. on Machine Learning, pages 727–734. Morgan Kaufmann, San Francisco, CA, 2000.
[zheng-tobacco]“Tobacco Distribution Based on Improved K-means Algorithm”, Zheng, B. and Tang, F. and Yang, R.I, in IEEE/INFORMS 2009
[reynolds1999relationship]“A relationship customer typology”, Reynolds, K.E. and Beatty, S.E, in Journal of Retailing, vol 75, 2009

Running k-means

Available options:

--help Print this information.

--references_in REQUIRED file containing data to be clustered.

--memberships_out OPTIONAL file to store cluster memberships.

--centroids_out OPTIONAL file to store cluster means.

--k_clusters(=2) Number of clusters.

--point(=dense) Point type used by k-means. One of: dense, sparse, dense_sparse, categorical, dense_categorical

--metric(=l2) Metric function used by k-means. One of: l2, weighted_l2

--metric_weights_in A file containing weights for use with –metric=weighted_l2

--dense_sparse_scale (=1) The scaling factor for the sparse part of --point=dense_sparse or --point=dense_categoric. You can only use it with --metric=weighted_l2`

--algorithm (=tree) Algorithm used to compute clusters. One of tree or naive

--tree (=kdtree) Tree structure used by k-means. One of kdtree or balltree

--leaf_size (=20) Maximum number of points at a leaf of the tree. More saves on tree overhead but too much hurts asymptotic run-time.

--iterations (=-1) K-means can run in either batch or progressive mode. If --iterations=i is omitted, K-means finds exact clusters; otherwise, it terminates after i progressive refinements.

--log A file to receive the log, or omit for stdout.

--loglevel (=debug) Level of log detail. One of:

  • debug: log everything
  • verbose: log messages and warnings
  • warning: log only warnings
  • silent: no logging

A simple example

You must first connect to a cloud machine, for instructions see Running 1305 on the Amazon EC2 cloud. The kmeans algorithm can be run from your home directory. Run the following command on the console

kmeans --help

This will display all the options available to you and test that you are being able to run the executable. Next let us say you have a csv file which contains your data. You want to run kmeans clustering on this data. Here is the command you want to run

kmeans --references_in=samples/sample100k.txt \
       --centroids_out=centroids.txt          \
       --k_clusters=4                         \
       --memberships_out=assignments.txt

The arguments are as follows:

  1. samples/sample100k.tst is the csv which contains the data (available on the cloud).
  2. Output file centroids.txt will be created which will contain the ‘k’ (here 4) centroids.
  3. k_clusters specifies the number of clusters.
  4. Output file assignments.txt will contain the cluster assignments for each point in the input dataset.

More advanced examples

This program makes use of multidimensional trees in order to compute kmeans. The default tree is kdtree. We also support the metric-tree also known as ball-tree. Ball tree might be faster in some cases, mainly in higher dimenions. In general there is not a rule of thumb for when to use kdtree or balltree. When you have sparse data though you should always use balltree as the kdtree may be very inefficient.

// Use of kdtree
kmeans --references_in=samples/sample100k.txt \
       --centroids_out=centroids.txt          \
       --k_clusters=4                         \
       --memberships_out=assignments.txt      \
       --tree=kdtree

// Use of balltree
kmeans --references_in=samples/sample100k.txt \
       --centroids_out=centroids.txt          \
       --k_clusters=4                         \
       --memberships_out=assignments.txt      \
       --tree=balltree

If your data was sparse this is how you would run it:

// Use of balltree with sparse data
 kmeans --references_in=samples/sample100k.txt \
        --centroids_out=centroids.txt          \
        --k_clusters=4                         \
        --memberships_out=assignments.txt      \
        --tree=balltree --point=sparse

You can also run the kmeans in the naive mode (without trees) and see the difference in performance:

// Fast implementation
time kmeans --references_in=samples/sample100k.txt \
            --centroids_out=centroids.txt          \
            --k_clusters=4                         \
            --memberships_out=assignments.txt

 // Slow implementation
time kmeans --references_in=samples/sample100k.txt \
            --centroids_out=centroids.txt          \
            --k_clusters=4                         \
            --memberships_out=assignments.txt      \
            --algorithm=naive

Suggestions and bug reports

We would like to know your opinion about our product. Please send us bug reports and suggestions at support@analytics1305.com