K-means is a clustering algorithm that aims to partition a set of observation points in the Euclidean space into
clusters. The clusters produced by this method has the property of each observation belonging to the cluster with the nearest mean.
- Input: a set of
reference points.
- Output: the coordinates of the centers of each of the
clusters indexed from
to
, the membership vector of each point whose
-th component denotes the cluster index to which the point belongs to.
Note
In this section, the “mean”, “centroid”, and “center” are used integerchangeably to mean the geometric center of a set of points. For a set of points,
, where each observation
, the “mean” or the “cenroid” or the “center” of the set is given by:
.
Clustering is a form of unsupervised learning which tries to find structures in the data without using any labels or target values. Clustering partitions a set of observations into separate groupings such that an observation in a given group are more similar to another observation in the same group than to another in a different group.
Note that clustering methods require a notion of similarity. The
canonical form of
-means defines the similarity measure using
the Euclidean metric. For
,
is more
similar to
than
if and only if
. Given a set of observations
, where each observation
, then
-means clustering aims to partition the
observations into
sets
with the objective to minimize the squared Euclidean distance
between points in any subset and their center of mass:
where
is the mean of
.
The naive version version of the
-means algorithm (sometimes
referred to as Lloyd’s algorithm) is the following:
.. _algorithm:
- Choose
points in the geometrical space to represent the starting
cluster means. These are called the
means.
- Assignment Step: Assign each point in the data to the closest mean. This step involves calculating the distance between each input point in the data to each of the
means and thus has
complexity where
is the size of the input data set.
- Update Step: Calculate the new set of
means as that of the centre of mass of observations for their respective cluster.
- If the new
means are the same as the old
means i.e. nothing changed, terminate. Otherwise go to step 2.
Note
Numerical round-off errors can cause some of the
distances between a given point and the means of the
clusters to have the same value. In this case, we employ a simple tie-breaking rule to assign the point to the cluster with the lowest cluster index everytime to facilitate convergence.
Choosing the right number of clusters is a hard problem in
-means. It is often chosen ad hoc based on prior
knowledge. The following methods are the most common methods for
automatically choosing the number of clusters
while
clustering.
[pelleg2000extending] uses the Bayesian Information Criterion:
where
is the log-likelihood the dataset according
to the cluster model
,
the number of
parameters in the cluster model
and
is the number
of points the model
. The cluster model that maximies BIC is
chosen in the end.
-means clustering assumes that each cluster comes from a
unimodal distribution such as a Gaussian distribution. From this
assumption, [hamerly2003learning] uses a statistical test of a
Gaussian fit for a set of points to decide whether to split a given
cluster into two. In this paper, the Anderson-Darling statistic is
computed to perform the hypothesis testing. The following steps
describe the algorithm:
- The algorithm produces an initial set of
clusters, which can start from 1 or a higher value based on prior knowledge.
- Let
be the number of clusters. For each cluster
, pick two pivots
and
inside
, and project each point in
onto
. Test whether the projected points has the univariate Gaussian distribution by using the Anderson-Darling statistic.
- If it does not follow the Gaussian distribution: discard the original center of
,
, and add
to the list of centers; re-run
-means and go back to step 2. If it does follow the Gaussian distribution, then terminate.
However, note that clusters produced by practical k-means
clusterings for a fixed value of
itself are only locally
optimal. This is because the associated optimization problem is a
mixed-integer non-convex optimization one. The procedures described above
are just heuristics.
-means is very popular in retail and customer segmentation [zheng-tobacco], [reynolds1999relationship].
The 1305 implementation of the
-means algorithms is derived
from the algorithm explained in algorithm_. The technique uses
-trees (we also support ball/metric-trees) and geometric
reasoning to accelerate the assignment step (step 2) and can be used
to potentially accelerate update step (step 3). We first index the
data points that are to be clustered using a spatial partitioning
tree.
containing a subset of the data points
be given. We can assign
to the cluster
if
for
The cluster mean of
,
is said to dominate
with respect to all other cluster means. Note that any hyper-region could replace the hyper-rectangle bonding primitive, as long as there is a way to compute the minimum distance and the maximum distance between the bound and a point.
. When binary trees such as
-trees are used, we test on the left and the right subsets of
using their bounding primitives. When a leaf node is reached, we simply assign each point in it to the closest cluster center.
time. For more details, please refer [pelleg1999accelerating].
such that it completely ‘dominates’ all other means.
as mentioned is also the mean closest to the hyper-rectangle. Yet, if
does not dominate all other means we recurse to the children of the hyper-rectangle where again the same
computation is done. The information that
dominated some (if any but not all) of the means is lost. It can be shown that if
dominates another mean
, than
dominates all children of the hyper-rectangle owned by
as well. By passing this information down the tree we can save more computation.Finally, if we were to store the center of mass, as cache sufficient statistics, of each hyper-rectangle, then once a dominant mean was found an iteration through all the points in this hyper-rectangle would not be necessary. A simple weighted add could be performed to facilitate the update step.
The current implementation implements an accelerated version of the Lloyd’s algorithm and supports the following features:
- The similarity measure under the weighted Euclidean metric or the Euclidean metric.
- Computing the
-means using the naive algorithm or the accelerated tree-version [pelleg1999accelerating].
- The ability to index the data points for the computation using a
-tree or a ball-tree.
- The progressive computation of
-means. The
-means algorithm is an inherently iterative method, which can be made to terminate after a fixed number of iterations.
The following features are currently unsupported and will be implemented soon:
- Multiple restarts to avoid local minima in the produced clusters.
- Cross-validation.
- More compreshensive diagonistic outputs for the produced clusters, such as the covariance of each cluster.
![]()
Figure 1- Visualization of the hyper-rectangles owned by centers. All points that belong to a specific center are colored the same color.
| [hamerly2003learning] | “Learning the k in k-means”, Hamerly, G. and Elkan, C. In Advances in Neural Information Processing Systems 2003. |
| [pelleg1999accelerating] | (1, 2) “Accelerating exact k-means algorithms with geometric reasoning”, Pelleg, D. and Moore, A. Proceedings of the fifth ACM SIGKDD 1999. |
| [pelleg2000extending] | “X-means: Extending K-means with efficient estimation of the number of clusters”, In Proceedings of the 17th International Conf. on Machine Learning, pages 727–734. Morgan Kaufmann, San Francisco, CA, 2000. |
| [zheng-tobacco] | “Tobacco Distribution Based on Improved K-means Algorithm”, Zheng, B. and Tang, F. and Yang, R.I, in IEEE/INFORMS 2009 |
| [reynolds1999relationship] | “A relationship customer typology”, Reynolds, K.E. and Beatty, S.E, in Journal of Retailing, vol 75, 2009 |
Available options:
--help Print this information.
--references_in REQUIRED file containing data to be clustered.
--memberships_out OPTIONAL file to store cluster memberships.
--centroids_out OPTIONAL file to store cluster means.
--k_clusters(=2) Number of clusters.
--point(=dense) Point type used by k-means. One of: dense, sparse, dense_sparse, categorical, dense_categorical
--metric(=l2) Metric function used by k-means. One of: l2, weighted_l2
--metric_weights_in A file containing weights for use with –metric=weighted_l2
--dense_sparse_scale (=1) The scaling factor for the sparse part of --point=dense_sparse or --point=dense_categoric. You can only use it with --metric=weighted_l2`
--algorithm (=tree) Algorithm used to compute clusters. One of tree or naive
--tree (=kdtree) Tree structure used by k-means. One of kdtree or balltree
--leaf_size (=20) Maximum number of points at a leaf of the tree. More saves on tree overhead but too much hurts asymptotic run-time.
--iterations (=-1) K-means can run in either batch or progressive mode. If --iterations=i is omitted, K-means finds exact clusters; otherwise, it terminates after i progressive refinements.
--log A file to receive the log, or omit for stdout.
--loglevel (=debug) Level of log detail. One of:
- debug: log everything
- verbose: log messages and warnings
- warning: log only warnings
- silent: no logging
You must first connect to a cloud machine, for instructions see Running 1305 on the Amazon EC2 cloud. The kmeans algorithm can be run from your home directory. Run the following command on the console
kmeans --help
This will display all the options available to you and test that you are being able to run the executable. Next let us say you have a csv file which contains your data. You want to run kmeans clustering on this data. Here is the command you want to run
kmeans --references_in=samples/sample100k.txt \ --centroids_out=centroids.txt \ --k_clusters=4 \ --memberships_out=assignments.txt
The arguments are as follows:
This program makes use of multidimensional trees in order to compute kmeans. The default tree is kdtree. We also support the metric-tree also known as ball-tree. Ball tree might be faster in some cases, mainly in higher dimenions. In general there is not a rule of thumb for when to use kdtree or balltree. When you have sparse data though you should always use balltree as the kdtree may be very inefficient.
// Use of kdtree kmeans --references_in=samples/sample100k.txt \ --centroids_out=centroids.txt \ --k_clusters=4 \ --memberships_out=assignments.txt \ --tree=kdtree // Use of balltree kmeans --references_in=samples/sample100k.txt \ --centroids_out=centroids.txt \ --k_clusters=4 \ --memberships_out=assignments.txt \ --tree=balltree
If your data was sparse this is how you would run it:
// Use of balltree with sparse data kmeans --references_in=samples/sample100k.txt \ --centroids_out=centroids.txt \ --k_clusters=4 \ --memberships_out=assignments.txt \ --tree=balltree --point=sparse
You can also run the kmeans in the naive mode (without trees) and see the difference in performance:
// Fast implementation time kmeans --references_in=samples/sample100k.txt \ --centroids_out=centroids.txt \ --k_clusters=4 \ --memberships_out=assignments.txt // Slow implementation time kmeans --references_in=samples/sample100k.txt \ --centroids_out=centroids.txt \ --k_clusters=4 \ --memberships_out=assignments.txt \ --algorithm=naive
We would like to know your opinion about our product. Please send us bug reports and suggestions at support@analytics1305.com