Customer Segmentation Project Using K-Means Clustering

Jupyter notebook available here

Table of Contents

  1. Overview
  2. Dataset Description
  3. Objective
  4. Theoretical Background
  5. Methodology
  6. Key Results
  7. Customer Segmentation Strategies
  8. Conclusion
  9. References

Overview

This project utilizes K-Means clustering, an unsupervised machine learning algorithm, to segment customers based on their purchasing behavior. By understanding customer profiles, businesses can tailor their strategies to boost customer satisfaction, loyalty, and revenue.

Dataset Description

The dataset is available here. This Online Retail II data set contains all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011.The company mainly sells unique all-occasion gift-ware. Many customers of the company are wholesalers.

Some Additional Information :

Feature Description
InvoiceNo Unique 6-digit identifier for each transaction. Cancellation codes start with ‘C’.
StockCode Unique code assigned to each product.
Description Name of the product.
Quantity Number of units purchased in a transaction.
InvoiceDate Date and time of the transaction.
UnitPrice Price per unit in GBP (£).
CustomerID Unique identifier for each customer.
Country Country where the customer resides.

Objective

The primary goals of this project are:


Theoretical Background

Why K-Means Clustering?

K-Means is chosen for this project because:

  1. Efficiency: K-Means works well with large datasets and provides quick clustering results.
  2. Scalability: It can handle a variety of data sizes and complexities.
  3. Interpretability: The results are easy to visualize and interpret, especially for customer segmentation tasks.
  4. Versatility: It works well with numerical data, which is predominant in this dataset.

Understanding the Algorithm

Steps in K-Means Clustering:

  1. Initialization:
    • Choose the number of clusters (‘k’).
    • Randomly initialize cluster centroids.
  2. Assignment Step:
    • Assign each data point to the nearest centroid using the Euclidean distance.
  3. Update Step:
    • Recalculate the centroids by taking the mean of all points assigned to a cluster.
  4. Convergence:
    • Repeat steps 2 and 3 until the centroids no longer change significantly or a predefined number of iterations is reached.

Mathematical Objective:

K-Means minimizes the Within-Cluster Sum of Squares (WCSS):

\[WCSS = \sum_{i=1}^{k} \sum_{x \in C_i} \| x - \mu_i \|^2\]

Where:

Silhouette Score:

The Silhouette Score evaluates the quality of clustering by comparing intra-cluster and inter-cluster distances. It ranges from (-1) to (1):

\[S(i) = \frac{b(i) - a(i)}{\max\{a(i), b(i)\}}\]

Where:


Methodology

Exploratory Data Analysis (EDA)

EDA involved:

  1. Checking for null values and invalid data (e.g., negative prices).
  2. Analyzing the distribution of numerical features like Quantity, UnitPrice, and CustomerID.
  3. Identifying patterns in InvoiceDate and Country to uncover insights into purchasing trends.

Observations:

Data Cleaning

  1. Removing Invalid Entries:
    • Dropped rows with missing CustomerID.
    • Excluded transactions with negative or zero Quantity and UnitPrice.
  2. Processing Categorical Variables:
    • Cleaned StockCode by removing non-product entries like “ADJUST” and “TEST”.

Feature Engineering

  1. Created New Features:
    • sales_line_total = Quantity × UnitPrice: Total revenue per transaction.
  2. Aggregated Data:
    • Grouped by CustomerID to calculate:
      • Monetary Value: Total spending.
      • Frequency: Number of unique transactions.
      • Recency: Days since the last purchase.

Data Scaling and Preprocessing

Where :

Clustering Process

  1. Optimal K Selection:
    • Used the Elbow Method and Silhouette Scores to determine $k$ = 4.
  2. K-Means Execution:
    • Clustered scaled data into four segments. Elbow Method Graph

      Key Results

Cluster Analysis

Cluster Characteristics Description
0 High-value, frequent buyers Regular buyers with high spending.
1 Infrequent, low-value customers Customers who purchase sporadically.
2 New or low-engagement customers Customers with low spending but recent activity.
3 Loyal, high-frequency, high-value buyers The most valuable customers in terms of revenue and engagement.

Customer Segmentation Strategies

Cluster 0: “Retain”

Cluster 1: “Re-Engage”

Cluster 2: “Nurture”

Cluster 3: “Reward”


Conclusion

K-Means clustering successfully segmented customers into actionable groups. These insights can drive personalized marketing and improve overall business strategy. Elbow Method Graph

Next Steps

  1. Deploy the clustering model in a production environment.
  2. Integrate segmentation results into CRM systems.
  3. Experiment with alternative clustering algorithms (e.g., Hierarchical Clustering).

References