K-means clustering explained with Python

Naomy Duarte Gomes
7 min readJun 10, 2022

--

K-means clustering is an unsupervised machine learning algorithm to analyze the formation of clusters in the data, which means finding groups of datapoints. How does it work? What is the logic behind this algorithm? What are the advantages of clustering? Let’s explore the particularities of this clustering technique.

Clustering

Sometimes, when analyzing data, some natural similarities between the datapoints arise and we observe the formation of patterns. The ways that these collections of patterns organize are the clusters.

Fig. 1 — Datapoints on the left naturally form clusters (green, blue, and yellow).

When clustering happens in a dataset, it allows us to explore relationships among the datapoints. This can be useful as a way of drawing direct conclusions about the dataset or can be used to create new features that will make it simpler for a model to learn the data.

Take as an example a dataset of mall customers where the features are: Genre, Age, Annual Income, and Spending Score. The Annual Income is measured in thousands of dollars and the Spending Score is a value between 1 and 100 assigned to each customer based on how much they spend in the mall. The higher the value, the more they spend.

Fig. 2 — First five columns of the mall customers dataset.

A mall would be interested in analyzing how much a customer spends and how much they make in a year, to better recommend products, discounts, etc. The datapoints we would like to analyze here are, therefore, from the features Annual Income and Spending Score. Let’s take a look at a scatter plot of these two columns:

Fig. 3 — Scatter plot of the features Annual Income and Spending Score.

We can notice some clustering formations arising naturally in this dataset:

Fig. 4 — Inferring cluster formation by eye with the features Annual Income and Spending Score.

We could also be interested in whether the datapoints of the Age and Spending Score features present clusters. Let’s do a scatter plot and see what is the datapoint distribution:

Fig. 5 — Scatter plot of the features Age and Spending Score.

There is no naturally arising cluster formation, at least not so obvious to the eye as in the previous case. A closer inspection could, however, reveal some potential clustering.

Fig. 6 — Inferring cluster formation by eye with features Age and Spending Score.

We will analyze these two cases and write the k-means algorithm for both of them.

Algorithm and application with Python

K-means clustering is an unsupervised machine learning algorithm that works based on the squared error criterion, which is derived from the Euclidean distance. This distance is given by

Fig. 7 — Euclidean distance between two points in two dimensions.

What the algorithm will do is use this squared error to analyze how well the datapoints are assigned to a certain cluster. It will search to minimize the squared error, meaning that the distance from the datapoints to their assigned cluster center will be minimal.

The algorithm behind the k-means clustering is as follows:

  1. Choose the number k of clusters.
  2. Select, randomly, k points where the k centroids will be initialized.
  3. For each point in the dataset, assign it to the nearest centroid. The cluster formation begins.
  4. Once the clusters are formed, calculate the center of mass of each of them and move their respective centroids to this center of mass.
  5. Reassign each data point to the nearest centroid again. If that step doesn’t reassign the datapoints to new clusters, then convergence is reached.

Here are some animations of the clustering process.

You may now be wondering: how do we know how many clusters to choose? And to answer that question, there is the Elbow method. It consists of a plot of the WCSS ( Within-Cluster Sum of Square ) as a function of the number k of clusters. The WCSS is the sum of the squared distance between each point and the centroids of the clusters.

When the curve shows an elbow, that’s the number of clusters from which the variation in the squared error won’t change much, and it’s this number we choose to be k.

Now that we know how to choose k, let’s implement the k-means on the mall customers dataset.

Importing libraries, reading the table, and looking at its shape:

As we saw in Fig. 3, the scatter plot of columns Annual Income and Spending Score suggests clustering. We will then select these columns and apply the k-means algorithm to generate the elbow plot:

which indicates the number of clusters to be 5. We can finally apply k-means to the data, which will give us the clusters (five clusters: 0, 1, 2, 3, 4) that each line of the columns we choose belongs to, here represented by the new column Cluster.

The next step is to graphically see the clusters:

If we compare the cluster formation with k-means to the cluster formation we could see by eye in Fig. 4, there’s an agreement. Now, let’s see what kind of cluster formation, if any, there is for the features Age and Spending Score.

The elbow method suggests k=4, one more cluster that we could infer by eye in Fig. 6. Applying k-means and adding the column Cluster2, we have

We now can see the scatter plot with the cluster formation and compare it to Fig. 6:

The k-means algorithm identified 4 clusters and, contrary to what I drew in Fig. 6, it separated the upper left points into two clusters (pink and green), considered the points on the bottom as belonging to the same cluster (yellow), and the points on the middle right as another cluster (blue).

Silhouette analysis

Silhouette analysis can be used to tell how well the datapoints fit in a cluster. In this method, cluster sizes can be analyzed and we can also obtain average values of silhouette coefficients that measure how far or close the datapoints of one cluster are to other clusters. This coefficient is in the range [-1,1], where negative values show the intersection of clusters, 0 means that clusters are very near, and close to 1 values indicate clusters are far from each other. More on the silhouette analysis on k-means clustering can be found here. In this text, we will restrict ourselves to calculating the mean Silhouette Coefficient for each of the analyzed cases.

For the features Annual Income and Spending Score, we have a silhouette score of ~0.55, which is a good value and confirms what we saw, that is, clusters are, on average, far apart.

For the features Age and Spending Score, we have a silhouette score of ~0.50, which is lower than the first score, but it is also good.

Observations

Important observation 1: the k-means algorithm relies on features being equally scaled to work properly, otherwise it can result in wrong clustering. Therefore, it is important to standardize features (in the present case, features were equally scaled and there was no need to standardize). Important observation 2: the groups should have more or less the same number of observations. Important observation 3: k-means is sensitive concerning the initial conditions of the centroids. That’s why we chose ‘k-means++’, but there are many discussions and several other techniques to deal with this problem (see here), which will not be discussed.

With the conditions above satisfied, and if the dataset has clustering formation, we can obtain valuable information about our data with k-means clustering. Another use of it is to improve model performance. As an example, in the Titanic Spaceship Kaggle code that can be found here, clustering of the columns Room Service, Food Court. Shopping Mall, Spa, VR, and Deck improved model performance, making it easier for the model to find the relationships between those columns.

Conclusions

The k-means algorithm is a widely used algorithm to analyze data and the interrelationships between features. It can be used in different situations such as separating clients based on their spending habits, their musical taste or TV preferences, etc. We can also use it to improve model performance. The elbow method allows us to choose the number of clusters if it's not visually obvious, and the algorithm is very straightforward to apply in the dataset, making k-means a very advantageous method to be used.

--

--

Naomy Duarte Gomes
Naomy Duarte Gomes

Written by Naomy Duarte Gomes

PhD in Physics, MBA in Data Science and Data Scientist writing about data.

No responses yet