Coding K-Means Clustering using Python and NumPy
For the day-to-day work of a Machine Learning Engineer or Data Scientist, it is common to use popular ML frameworks like Scikit-learn, Pytorch, etc. These frameworks provide us with highly optimized implementations of most ML algorithms to use out of the box.
Despite this, it's a good exercise to try and code some of the basic algorithms from scratch, or using just NumPy. Writing code helps solidify our conceptual understanding of the algorithms, and improves our coding ability. Implementing ML algorithms without using frameworks is also a popular interview exercise. Thus, it's best to be able to code algorithms such as K-Means, K Nearest Neighbours, Linear Regression and Logistic Regression.
In this post, we'll implement the K-means clustering algorithm. The code is adapted from multiple sources listed in the references at the bottom, but presented in a way to represent the block-by-block process of coding something relatively complex.
K-means clustering is an unsupervised learning algorithm, which groups an unlabeled dataset into different clusters. The "K" refers to the number of pre-defined clusters the dataset is grouped into.
We'll implement the algorithm using Python and NumPy to understand the concepts more clearly.
- K = number of clusters
- X = training data of shape (m, n): m samples and n features
- max_iterations = max number of iterations to run the algorithm for
Plainly, the algorithm entails the following steps:
- Randomly initialize K cluster centroids i.e. the center of the clusters.
- Repeat till convergence or end of max number of iterations:
- For samples i=1 to m in the dataset:
- Assign the closest cluster centroid to X[i]
- For cluster k=1 to K:
- Find new cluster centroids by calculating the mean of the points assigned to cluster k.
- For samples i=1 to m in the dataset:
We will define the needed functions as and when we require them.
1import numpy as np
1. Randomly initialize K cluster centroids
As a starting point, we'll initialize the K cluster centoids by picking K samples at random from the dataset X.
Note that this method of initialization can result in different clusters being found in different runs of the algorithm. The clusters will also depend on the location of the initial centroids.
A smarter initialization mehtod, which produces more stable clusters, while maximizing the distance between a centroid to other centroids is the k-means++ algorithm. We won't be covering it here, but feel free to read up on it. K-means++ is the initialization algorithm used in Scikit-learn's implementation.
1# randomly initializing K centroid by picking K samples from X2def initialize_random_centroids(K, X):3 """Initializes and returns k random centroids"""4 m, n = np.shape(X)5 # a centroid should be of shape (1, n), so the centroids array will be of shape (K, n)6 centroids = np.empty((K, n))7 for i in range(K):8 # pick a random data point from X as the centroid9 centroids[i] = X[np.random.choice(range(m))]10 return centroids
2. Calculate euclidean distance between two vectors
In order to find the closest centroid for a given sample x, we can use Euclidean Distance between a given centroid and x.
The euclidean distance between two points, p and q in Euclidean n-space is given by the formula:
This can be adapted by thinking in terms of two vectors x1 and x2:
1def euclidean_distance(x1, x2):2 """Calculates and returns the euclidean distance between two vectors x1 and x2"""3 return np.sqrt(np.sum(np.power(x1 - x2, 2)))
We can also use the calculate the same by taking the L2 norm of the difference between the two vectors. This can be accomplished using NumPy:
1np.linalg.norm(x1 - x2)
3. Finding the closest centroid to a given data point
We can find the closest centriod for a given data point by iterating through the centroids and picking the one with the minimum distance.
1def closest_centroid(x, centroids, K):2 """Finds and returns the index of the closest centroid for a given vector x"""3 distances = np.empty(K)4 for i in range(K):5 distances[i] = euclidean_distance(centroids[i], x)6 return np.argmin(distances) # return the index of the lowest distance
4. Create clusters
Assign the samples to closest centroids to create the clusters:
1def create_clusters(centroids, K, X):2 """Returns an array of cluster indices for all the data samples"""3 m, _ = np.shape(X)4 cluster_idx = np.empty(m)5 for i in range(m):6 cluster_idx[i] = closest_centroid(X[i], centroids, K)7 return cluster_idx
5. Compute means
Compute the means of cluster to find new centroids.
NumPy axes can be tricky if you're just starting out. This article is an excellent refresher.
1def compute_means(cluster_idx, K, X):2 """Computes and returns the new centroids of the clusters"""3 _, n = np.shape(X)4 centroids = np.empty((K, n))5 for i in range(K):6 points = X[cluster_idx == i] # gather points for the cluster i7 centroids[i] = np.mean(points, axis=0) # use axis=0 to compute means across points8 return centroids
6. Putting everything together
Let's build a function that can run the K-means algorithm for the required number of iterations, or till convergence.
1def run_Kmeans(K, X, max_iterations=500):2 """Runs the K-means algorithm and computes the final clusters"""3 # initialize random centroids4 centroids = initialize_random_centroids(K, X)5 # loop till max_iterations or convergance6 for _ in range(max_iterations):7 # create clusters by assigning the samples to the closet centroids8 clusters = create_clusters(centroids, K, X)9 previous_centroids = centroids10 # compute means of the clusters and assign to centroids11 centroids = compute_means(clusters, K, X)12 # if the new_centroids are the same as the old centroids, return clusters13 diff = previous_centroids - centroids14 if not diff.any():15 return clusters16 return clusters
7. Testing it out
To test our implementation, we can use Scikit-learn's
1from sklearn import datasets2# creating a dataset for clustering3X, y = datasets.make_blobs()4y_preds = run_Kmeans(3, X)
8. Plotting the results
To plot the clusters in 2D, we can use the plotting function from ML-From-Scratch Github repository. We'll plot the clusters calculated by our implementation, and the ones returned by Scikit-learn.
1from mlfromscratch.utils import Plot2p = Plot()3p.plot_in_2d(X, y_preds, title="K-Means Clustering")4p.plot_in_2d(X, y, title="Actual Clustering")
Again, the clusters can depend on the initialization points of centroids, but this time it looks like our implementation was able to find the correct clusters.
In this post, we saw how we can implement K-means clustering algorithm from scratch using Python and NumPy. Be sure to brush up other concepts and implementation before giving your next ML interview!
- ML From Scratch - An excellent Github repository containing implementations of many machine learning models and algorithms. Easy to understand and highly recommended.
- Code ML Algorithms from Scracth - A good blog post similar to this one, echoing the sentiment that this type of exercise is common in ML interviews.