From the ground up

From the ground up

In k-means clustering, a data point refers to an individual record in the dataset. In this case, a data point corresponds to a customer's shopping behavior. So we are grouping customers whose behaviors are similar and detecting underlying patterns.

To achieve this objective, we need to decide how many clusters we want. The number of clusters is represented by k. A cluster refers to a collection of data points aggregated together because of certain similarities.

K-means looks for a fixed number k of clusters in a dataset. The k data points are considered as the initial centers of the clusters and are known as "centroids". A centroid is the imaginary or real location representing the center of the cluster.

Here's a simplified version of how it works:

  1. Initialization: The algorithm starts by randomly choosing 'k' data points from the dataset.

  2. Assignment: For every other data point in the dataset, the algorithm calculates the distance to all 'k' centroids. It then assigns each data point to the cluster represented by the centroid it is closest to. The distance is typically calculated using Euclidean distance (a straight line distance between two points), although other distance measures can also be used.

  3. Update: After all data points have been assigned to a cluster, the positions of the 'k' centroids are updated. The new position of a centroid is calculated as the mean (hence 'k-means') of all the data points assigned to that cluster (i.e., it's moved to the center of the data points).

  4. Repeat: Steps 2 and 3 are repeated until the centroids no longer move significantly, or a set number of iterations has been reached. Essentially, the algorithm is trying to find when the clusters are relatively stable and the data points aren't switching around between clusters.


Data preparation

The libraries we'll use are for data manipulation and visualization, and KMeans is the specific function we'll use to perform the clustering. Let's start with importing the necessary libraries:

# Import necessary library
import pandas as pd

Next, let's load the dataset (it takes a while):

# Load the dataset
retail_df = pd.read_excel('./data/Online_Retail.xlsx')

# Check basic information about the dataset
retail_df.info()

Out:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    541909 non-null  object        
 1   StockCode    541909 non-null  object        
 2   Description  540455 non-null  object        
 3   Quantity     541909 non-null  int64         
 4   InvoiceDate  541909 non-null  datetime64[ns]
 5   UnitPrice    541909 non-null  float64       
 6   CustomerID   406829 non-null  float64       
 7   Country      541909 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB

Next, we'll clean the data to ensure our data is in the right format, and we'll need to handle missing or unusual values.

In this data cleaning step, we are removing any rows that don't have a CustomerID, or where the quantity or price is zero or negative. These are unlikely to be meaningful for the clustering. We're also removing any duplicated records.

Check the data again after cleaning it to understand what each column represents and the distribution of values.

# Remove any rows with missing values, negative quantities or prices
retail_df = retail_df[retail_df['Quantity'] > 0] 
retail_df = retail_df[retail_df['UnitPrice'] > 0] 
retail_df = retail_df[retail_df['CustomerID'].notnull()] 

# Convert 'CustomerID' column to integer type
retail_df['CustomerID'] = retail_df['CustomerID'].astype(int)

# Check cleaned dataset information
retail_df.isnull().sum(), retail_df.shape

Out:

(None,
 InvoiceNo      0
 StockCode      0
 Description    0
 Quantity       0
 InvoiceDate    0
 UnitPrice      0
 CustomerID     0
 Country        0
 dtype: int64,
 (397884, 8))
# Remove any duplicate records
retail_df.drop_duplicates(inplace=True)
retail_df.shape # Print the shape of dataset to confirm duplicates removal

Out:

(392692, 8)
  • The 'CustomerID' is converted to integer type to ensure that it's handled right in the following calculations. Even though 'CustomerID' fields seem to be numbers, they might be stored as floats or strings in the original dataset. By converting the 'CustomerID' column to integer, we ensure that they are handled as whole numbers, which is more suitable and memory efficient for an ID field. If the column was already an integer, this operation wouldn't have any effect.

  • Then we count the total number of missing or null values in each column to quickly check if there is missing data in the DataFrame that might need to be handled.

  • Then we check how many rows (data points) and columns (variables) are in the DataFrame from its dimensionality (shape).

Once the data is ready, we'll proceed with preparing the specific columns that we will use for the clustering. This includes creating the SaleAmount column and calculating aggregated metrics like frequency, total sales amount and recency of purchase for each customer.

# Add 'SaleAmount' column
retail_df['SaleAmount'] = retail_df['UnitPrice'] * retail_df['Quantity']

# Create a new dataframe with each customer's frequency, total sales amount, and recency of purchase
aggregations = {    
    'InvoiceNo':'count',
    'SaleAmount':'sum',
    'InvoiceDate':'max'
}
customer_df = retail_df.groupby('CustomerID').agg(aggregations)
customer_df = customer_df.reset_index()

# Rename columns
customer_df = customer_df.rename(columns = {'InvoiceNo':'Freq', 'InvoiceDate':'ElapsedDays'})
  • The 'SaleAmount' column is used in the aggregation function to calculate the 'Total Sales Amount' for each customer. And then we group them by 'CustomerID'. For each group (each unique 'CustomerID'), we count the InvoiceNo (the number of purchases), find SaleAmout (the total sales amount) and get InvoiceDate (the latest purchase date).

  • After the aggregation operation count is applied, the resulting data in InvoiceNo column now represents the frequency of purchases for each customer, not the actual Invoice Numbers.

  • Similarly, after the operation max is applied and the recent date is subtracted from InvoiceDate column, the resulting data now represents the number of days elapsed since the last purchase for each customer, not the actual invoice dates.

customer_df.head(), customer_df.shape
(   CustomerID  Freq  SaleAmount         ElapsedDays
 0       12346     1    77183.60 2011-01-18 10:01:00
 1       12347   182     4310.00 2011-12-07 15:52:00
 2       12348    31     1797.24 2011-09-25 13:13:00
 3       12349    73     1757.55 2011-11-21 09:51:00
 4       12350    17      334.40 2011-02-02 16:01:00,
 (4338, 4))

Then we scale the features and standardize this prepared data before we actually run the k-means algorithm.