How it works...

K-means was developed by James MacQueen, who, in 1967, designed it for the purpose of dividing groups of objects into k partitions based on their attributes. It is a variation of the expectation-maximization (EM) algorithm, whose objective is to determine the groups of k data generated by Gaussian distributions. The difference between the two algorithms lies in the Euclidean distance calculation method. In k-means, it is assumed that the attributes of the object can be represented as vectors, and thus form a vector space. The goal is to minimize the total intra-cluster variance (or standard deviation). Each cluster is identified by a centroid.

The algorithm follows an iterative procedure, as follows:

  1. Choose the number of k clusters
  2. Initially, create k partitions and assign each entry partition either randomly, or by using some heuristic information
  3. Calculate the centroid of each group
  4. Calculate the distance between each observation and each cluster centroid
  5. Then, construct a new partition by associating each entry point with the cluster whose centroid is closer to it
  6. The centroid for new clusters is recalculated
  7. Repeat steps 4 to 6 until the algorithm converges