Skip to content
Sajal Sharma

Creating Customer Segments using Unsupervised Machine Learning

05.02.2017 -
Python, Scikit-learn, PCA, Clustering


In this project, we will analyze a dataset containing data on various customers' annual spending amounts (reported in monetary units) of diverse product categories for internal structure. One goal of this project is to best describe the variation in the different types of customers that a wholesale distributor interacts with. Doing so would equip the distributor with insight into how to best structure their delivery service to meet the needs of each customer.

The dataset for this project can be found on the UCI Machine Learning Repository. For the purposes of this project, the features 'Channel' and 'Region' will be excluded in the analysis — with focus instead on the six product categories recorded for customers.

1# Import libraries necessary for this project
2import numpy as np
3import pandas as pd
4from IPython.display import display # Allows the use of display() for DataFrames
6# Import supplementary visualizations code
7import visuals as vs
9# Pretty display for notebooks
10%matplotlib inline
12# Load the wholesale customers dataset
14 data = pd.read_csv("customers.csv")
15 data.drop(['Region', 'Channel'], axis = 1, inplace = True)
16 print "Wholesale customers dataset has {} samples with {} features each.".format(*data.shape)
18 print "Dataset could not be loaded. Is the dataset missing?"
Wholesale customers dataset has 440 samples with 6 features each.

Data Exploration

In this section, we will begin exploring the data through visualizations and code to understand how each feature is related to the others.

The dataset is composed of six important product categories: 'Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper', and 'Delicatessen'. The code block below produces a statistical summary for each of the above product categories.

1# Display a description of the dataset

Selecting Samples

To get a better understanding of the customers and how their data will transform through the analysis, lets select a few sample data points and explore them in more detail.

1# Select three indices of to sample from the dataset
2indices = [26,176,392]
4# Create a DataFrame of the chosen samples
5samples = pd.DataFrame(data.loc[indices], columns = data.keys()).reset_index(drop = True)
6print "Chosen samples of wholesale customers dataset:"
Chosen samples of wholesale customers dataset:

Guessing Establishments

Considering the total purchase cost of each product category and the statistical description of the dataset above for our sample customers. What kind of establishment (customer) could each of the three samples we've chosen represent?

Looking at the total purchase of each product category above and comparing them with the medians of the distributions, we can guess that:

  • The first customer in the sample (Index 0), might be from a restaurant. We see high amounts of Frozen, close to median amount of Fresh and Deli. So this can be from a restaurant.
  • The second customer in the sample (Index 1), might be from a supermarket. We see really high or close to median levels of purchases of all category of products excluding deli. So maybe the supermarket doesn't have a deli section.
  • The third customer in the sample (Index 2), might represent a cafe. We see a high purchase of milk and somewhat close to median levels for Groceries and Deli. We also see a relatively lower purchase of fresh produce and frozen goods.

Feature Relevance

One interesting thought to consider is if one (or more) of the six product categories is actually relevant for understanding customer purchasing. That is to say, is it possible to determine whether customers purchasing some amount of one category of products will necessarily purchase some proportional amount of another category of products? We can make this determination quite easily by training a supervised regression learner on a subset of the data with one feature removed, and then score how well that model can predict the removed feature.

Lets do this for the 'Milk' feature.

1from sklearn.cross_validation import train_test_split
2from sklearn.tree import DecisionTreeRegressor
4# Make a copy of the DataFrame, using the 'drop' function to drop the given feature
5new_data = data.drop(['Milk'],axis=1)
7# Split the data into training and testing sets using the given feature as the target
8X_train, X_test, y_train, y_test = train_test_split(new_data,data['Milk'],test_size=0.25,random_state=101)
10# Create a decision tree regressor and fit it to the training set
11regressor = DecisionTreeRegressor(random_state=101).fit(X_train,y_train)
13# Report the score of the prediction using the testing set
14score = regressor.score(X_test,y_test)
16print score

Feature Relevance Prediction

We tried to predict the 'Milk' feature (i.e. annual spending on milk products), based on the other features in the dataset (annual spending on other product categories).

The predicted R2 score was 0.2957. As we know that the R2 is between 0 and 1, the model we built for customer's milk purchasing habits isn't very good, although it is possible that there's some correlation between this feature and others.

It's safe to say that the 'Milk' feature is necessary for identifying customer's spending habits because it isn't possible to predict how a customer spends on Milk based on their spending on the other product categories. We can say that the 'Milk' feature adds extra (and maybe key) information to the data which is not easily inferable by model only through looking at the other features.

Visualize Feature Distributions

To get a better understanding of the dataset, we can construct a scatter matrix of each of the six product features present in the data. If it is found that the feature we attempted to predict above is relevant for identifying a specific customer, then the scatter matrix below may not show any correlation between that feature and the others. Conversely, if we believe that feature is not relevant for identifying a specific customer, the scatter matrix might show a correlation between that feature and another feature in the data.

1# Produce a scatter matrix for each pair of features in the data
2pd.scatter_matrix(data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');



Looking at the plot above, there are a few pairs of features that exhibit some degree of correlation. They include:

  • Milk and Groceries
  • Milk and Detergents_Paper
  • Grocery and Detergents_Paper

As we tried to predict the 'Milk' feature earlier, this confirms the suspicion that Milk isn't correlated to most of the features in the dataset, although it shows a mild correlation with 'Groceries' and 'Detergents_Paper'.

The distribution of all the features appears to be similar. It is strongly right skewed, in that most of the data points fall in then first few intervals. Judging by the summary statistics, especially the mean and maximum value points, of the features that we calculated earlier, we can expect that there are some outliers in each of the distributions. This conforms with the fact that there's a significant different between the mean and the median of the feature distributions.

Data Preprocessing

In this section, we will preprocess the data to create a better representation of customers by performing a scaling on the data and detecting (and optionally removing) outliers. Preprocessing data is often times a critical step in assuring that results you obtain from your analysis are significant and meaningful.

Feature Scaling

If data is not normally distributed, especially if the mean and median vary significantly (indicating a large skew), it is most often appropriate to apply a non-linear scaling — particularly for financial data. One way to achieve this scaling is by using a Box-Cox test, which calculates the best power transformation of the data that reduces skewness. A simpler approach which can work in most cases would be applying the natural logarithm.

1# Scale the data using the natural logarithm
2log_data = data.apply(lambda x: np.log(x))
4# Scale the sample data using the natural logarithm
5log_samples = samples.apply(lambda x: np.log(x))
7# Produce a scatter matrix for each pair of newly-transformed features
8pd.scatter_matrix(log_data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');



After applying a natural logarithm scaling to the data, the distribution of each feature should appear much more normal.

Let's check out our log transformed samples.

1# Display the log-transformed sample data

Outlier Detection

Detecting outliers in the data is extremely important in the data preprocessing step of any analysis. The presence of outliers can often skew results which take into consideration these data points. There are many "rules of thumb" for what constitutes an outlier in a dataset. Here, we will use Tukey's Method for identfying outliers: An outlier step is calculated as 1.5 times the interquartile range (IQR). A data point with a feature that is beyond an outlier step outside of the IQR for that feature is considered abnormal.

1# OPTIONAL: Select the indices for data points you wish to remove
2outliers = []
4# For each feature find the data points with extreme high or low values
5for feature in log_data.keys():
7 # Calculate Q1 (25th percentile of the data) for the given feature
8 Q1 = np.percentile(log_data[feature],25)
10 # Calculate Q3 (75th percentile of the data) for the given feature
11 Q3 = np.percentile(log_data[feature],75)
13 # Use the interquartile range to calculate an outlier step (1.5 times the interquartile range)
14 step = (Q3-Q1) * 1.5
16 # Display the outliers
17 print "Data points considered outliers for the feature '{}':".format(feature)
18 out = log_data[~((log_data[feature] >= Q1 - step) & (log_data[feature] <= Q3 + step))]
19 display(out)
20 outliers = outliers + list(out.index.values)
23#Creating list of more outliers which are the same for multiple features.
24outliers = list(set([x for x in outliers if outliers.count(x) > 1]))
26print "Outliers: {}".format(outliers)
28# Remove the outliers, if any were specified
29good_data = log_data.drop(log_data.index[outliers]).reset_index(drop = True)
30print "The good dataset now has {} observations after removing outliers.".format(len(good_data))
Data points considered outliers for the feature 'Fresh':

Data points considered outliers for the feature 'Milk':


Data points considered outliers for the feature 'Grocery':


Data points considered outliers for the feature 'Frozen':


Data points considered outliers for the feature 'Detergents_Paper':


Data points considered outliers for the feature 'Delicatessen':

Outliers: [128, 65, 66, 75, 154]
The good dataset now has 435 observations after removing outliers.

Upon quick inspection, our sample doesn't contain any of the outlier values.

There were 5 data points that were considered outliers for more than one feature based on our definition above. So, instead of removing all outliers (which would result in us losing a lot of information), only outliers that occur for more than one feature are removed.

We can also analyse these outliers independently to answer questions about how or when they occur (root cause analysis), but they might not be suitable for an aggregate analysis.

Feature Transformation

In this section we will use principal component analysis (PCA) to draw conclusions about the underlying structure of the wholesale customer data. Since using PCA on a dataset calculates the dimensions which best maximize variance, we will find which compound combinations of features best describe customers.


Now that the data has been scaled to a more normal distribution and has had any necessary outliers removed, we can now apply PCA to the good_data to discover which dimensions about the data best maximize the variance of features involved. In addition to finding these dimensions, PCA will also report the explained variance ratio of each dimension — how much variance within the data is explained by that dimension alone. Note that a component (dimension) from PCA can be considered a new "feature" of the space, however it is a composition of the original features present in the data.

1from sklearn.decomposition import PCA
3# Apply PCA by fitting the good data with the same number of dimensions as features
4pca = PCA().fit(good_data)
6# Transform log_samples using the PCA fit above
7pca_samples = pca.transform(log_samples)
9# Generate PCA results plot
10pca_results = vs.pca_results(good_data, pca)


The first and second features, in total, explain approx. 70.8% of the variance in our data.

The first four features, in total, explain approx. 93.11% of the variance.

In terms of customer spending,

  • Dimension 1 has a high positive weight for Milk, Grocery, and Detergents_Paper features. This might represent Hotels, where these items are usually needed for the guests.
  • Dimension 2 has a high positive weight for Fresh, Frozen, and Delicatessen. This dimension might represent 'restaurants', where these items are used for ingredients in cooking dishes.
  • Dimension 3 has a high positive weight for Deli and Frozen features, and a low posiive weight for Milk, but has negative weights for everything else. This dimension might represent Delis.
  • Dimension 4 has positive weights for Frozen,Detergents_Paper and Groceries, while being negative for Fresh and Deli. It's a bit tricky to pin this segment down, but I do believe that there are shops that sell frozen goods exclusively.

Let's see how the log-transformed sample data has changed after having a PCA transformation applied to it in six dimensions.

1# Display sample log-data after having a PCA transformation applied
2display(pd.DataFrame(np.round(pca_samples, 4), columns = pca_results.index.values))
Dimension 1Dimension 2Dimension 3Dimension 4Dimension 5Dimension 6

Dimensionality Reduction

When using principal component analysis, one of the main goals is to reduce the dimensionality of the data — in effect, reducing the complexity of the problem. Dimensionality reduction comes at a cost: Fewer dimensions used implies less of the total variance in the data is being explained. Because of this, the cumulative explained variance ratio is extremely important for knowing how many dimensions are necessary for the problem. Additionally, if a signifiant amount of variance is explained by only two or three dimensions, the reduced data can be visualized afterwards.

1# Apply PCA by fitting the good data with only two dimensions
2pca = PCA(n_components=2).fit(good_data)
4# Transform the good data using the PCA fit above
5reduced_data = pca.transform(good_data)
7# Transform log_samples using the PCA fit above
8pca_samples = pca.transform(log_samples)
10# Create a DataFrame for the reduced data
11reduced_data = pd.DataFrame(reduced_data, columns = ['Dimension 1', 'Dimension 2'])

Let's see how the log-transformed sample data has changed after having a PCA transformation applied to it using only two dimensions.

1# Display sample log-data after applying PCA transformation in two dimensions
2display(pd.DataFrame(np.round(pca_samples, 4), columns = ['Dimension 1', 'Dimension 2']))
Dimension 1Dimension 2

Visualizing a Biplot

A biplot is a scatterplot where each data point is represented by its scores along the principal components. The axes are the principal components (in this case Dimension 1 and Dimension 2). In addition, the biplot shows the projection of the original features along the components. A biplot can help us interpret the reduced dimensions of the data, and discover relationships between the principal components and original features.

Run the code cell below to produce a biplot of the reduced-dimension data.

1# Create a biplot
2vs.biplot(good_data, reduced_data, pca)


Once we have the original feature projections (in red), it is easier to interpret the relative position of each data point in the scatterplot. For instance, a point the lower right corner of the figure will likely correspond to a customer that spends a lot on 'Milk', 'Grocery' and 'Detergents_Paper', but not so much on the other product categories.


In this section, we will choose to use either a K-Means clustering algorithm or a Gaussian Mixture Model clustering algorithm to identify the various customer segments hidden in the data. We will then recover specific data points from the clusters to understand their significance by transforming them back into their original dimension and scale.

K-Means or Gaussian Mixture Model?

From what we know of both models.

Advantages of K-Means clustering:

  • Simple, easy to implement and interpret results.
  • Good for hard cluster assignments i.e. when a data point only belongs to one cluster over the others.

Advantages of Gaussian Mixture Model clustering:

  • Good for estimating soft clusters i.e. we're not sure if a point belongs to one cluster over another.
  • Does not bias the cluster sizes to have specific structures in the cluster that may or may not exist.

Given what we know about the wholesale customer data so far, we'll chose to use Gaussian Mixture Model clustering over K-Means. This is because there might be some hidden patterns in the data that we may miss by assigning only one cluster to each data point. For example, let's take the case of the Supermarket customer in our sample: while doing PCA, it had similar and high positive weights for multiple dimensions, i.e. it didn't belong to one dimension over the other. So a supermarket may be a combination of a fresh produce store/grocery store/frozen goods store.

We'll choose GMM, so that we don't miss cases like these.

Creating Clusters

Depending on the problem, the number of clusters that we expect to be in the data may already be known. When the number of clusters is not known a priori, there is no guarantee that a given number of clusters best segments the data, since it is unclear what structure exists in the data — if any. However, we can quantify the "goodness" of a clustering by calculating each data point's silhouette coefficient. The silhouette coefficient for a data point measures how similar it is to its assigned cluster from -1 (dissimilar) to 1 (similar). Calculating the mean silhouette coefficient provides for a simple scoring method of a given clustering.

1n_clusters = [8,6,4,3,2]
3from sklearn.mixture import GMM
4from sklearn.metrics import silhouette_score
6for n in n_clusters:
8 # Apply your clustering algorithm of choice to the reduced data
9 clusterer = GMM(n_components=n).fit(reduced_data)
11 # Predict the cluster for each data point
12 preds = clusterer.predict(reduced_data)
14 # Find the cluster centers
15 centers = clusterer.means_
17 # Predict the cluster for each transformed sample data point
18 sample_preds = clusterer.predict(pca_samples)
20 # Calculate the mean silhouette coefficient for the number of clusters chosen
21 score = silhouette_score(reduced_data,preds)
23 print "The silhouette_score for {} clusters is {}".format(n,score)
The silhouette_score for 8 clusters is 0.310453413564
The silhouette_score for 6 clusters is 0.271498911484
The silhouette_score for 4 clusters is 0.332870064265
The silhouette_score for 3 clusters is 0.376166165091
The silhouette_score for 2 clusters is 0.411818864386

Of the several cluster numbers tried, 2 clusters had the best silhouette score.

Cluster Visualization

1# Display the results of the clustering from implementation
2vs.cluster_results(reduced_data, preds, centers, pca_samples)


Data Recovery

Each cluster present in the visualization above has a central point. These centers (or means) are not specifically data points from the data, but rather the averages of all the data points predicted in the respective clusters. For the problem of creating customer segments, a cluster's center point corresponds to the average customer of that segment. Since the data is currently reduced in dimension and scaled by a logarithm, we can recover the representative customer spending from these data points by applying the inverse transformations.

1# Inverse transform the centers
2log_centers = pca.inverse_transform(centers)
4# Exponentiate the centers
5true_centers = np.exp(log_centers)
7# Display the true centers
8segments = ['Segment {}'.format(i) for i in range(0,len(centers))]
9true_centers = pd.DataFrame(np.round(true_centers), columns = data.keys())
10true_centers.index = segments
Segment 08812.02052.02689.02058.0337.0712.0
Segment 14316.06347.09555.01036.03046.0945.0

An interesting observation here could be, considering the total purchase cost of each product category for the representative data points above, and referencing the statistical description of the dataset at the beginning of this project, what set of establishments could each of the customer segments represent?

Taking an educated guess,

  • Segment 0: This segment best represents supermarkets. They spend a higher than median amount on Milk, Grocery, Detergents_Paper and Deli, which are both essential to be stocked in such places.

  • Segment 1: This segment best represents restaurants. Their spend on Fresh, and Frozen is higher than the median, and lower, but still close to median on Deli. Their spend on Milk, Grocery and Detergents_Paper is lower than median, which adds to our assessment.

Let's find which cluster each sample point is predicted to be.

1# Display the predictions
2for i, pred in enumerate(sample_preds):
3 print "Sample point", i, "predicted to be in Cluster", pred
Sample point 0 predicted to be in Cluster 0
Sample point 1 predicted to be in Cluster 0
Sample point 2 predicted to be in Cluster 0

Our guesses for Sample points 0,1, and 2 were restaurants, supermarket and cafe. It seems like we're close on the predictions for sample points 0 and 2, while incorrect, or rather inconsistent, with our predictions for sample point 1. Looking at the visualization for our cluster in the previous section, it could be that sample 1 is the point close to the boundary of both clusters.

Conclusion and Implications: How to use this knowledge?

In this final section, we will investigate ways that you can make use of the clustered data. First, we will consider how the different groups of customers, the customer segments, may be affected differently by a specific delivery scheme. Then, we will consider how giving a label to each customer (which segment that customer belongs to) can provide for additional features about the customer data.

Companies will often run A/B tests when making small changes to their products or services to determine whether making that change will affect its customers positively or negatively. The wholesale distributor is considering changing its delivery service from currently 5 days a week to 3 days a week. However, the distributor will only make this change in delivery service for customers that react positively.

How can the wholesale distributor use the customer segments to determine which customers, if any, would react positively to the change in delivery service?

Making the change to the delivery service means that products will be delivered fewer times in a week.

The wholesale distributor can identify the clusters to conduct the A/B test on, but the test should be done on one cluster at a time because the two clusters represent different types of customers, so their delivery needs might be different, and their reaction to change will, thus, be different. In other words, the control and experiment groups should be from the same cluster, at a time.

Additional structure is derived from originally unlabeled data when using clustering techniques. Since each customer has a customer segment it best identifies with (depending on the clustering algorithm applied), we can consider 'customer segment' as an engineered feature for the data. Assume the wholesale distributor recently acquired ten new customers and each provided estimates for anticipated annual spending of each product category. Knowing these estimates, the wholesale distributor wants to classify each new customer to a customer segment to determine the most appropriate delivery service.

How can the wholesale distributor label the new customers using only their estimated product spending and the* customer segment *data?

To label the new customers, the distributor will first need to build and train a supervised learner on the data that we labeled through clustering. The data to fit will be the estimated spends, and the target variable will be the customer segment i.e. 0 or 1 (i.e. grocery store or restaurant). They can then use the classifier to predict segments for new incoming data.

© 2023 Sajal Sharma.
Made with ❤️   +  GatsbyJS