In k-means clustering, a data point refers to an individual record in the dataset. In this case, a data point corresponds to a customer's shopping behavior. So we are grouping customers whose behaviors are similar and detecting underlying patterns.
To achieve this objective, we need to decide how many clusters we want. The number of clusters is represented by k
. A cluster refers to a collection of data points aggregated together because of certain similarities.
K-means looks for a fixed number k
of clusters in a dataset. The k
data points are considered as the initial centers of the clusters and are known as "centroids". A centroid is the imaginary or real location representing the center of the cluster.
Here's a simplified version of how it works:
Initialization: The algorithm starts by randomly choosing 'k' data points from the dataset.
Assignment: For every other data point in the dataset, the algorithm calculates the distance to all 'k' centroids. It then assigns each data point to the cluster represented by the centroid it is closest to. The distance is typically calculated using Euclidean distance (a straight line distance between two points), although other distance measures can also be used.
Update: After all data points have been assigned to a cluster, the positions of the 'k' centroids are updated. The new position of a centroid is calculated as the mean (hence 'k-means') of all the data points assigned to that cluster (i.e., it's moved to the center of the data points).
Repeat: Steps 2 and 3 are repeated until the centroids no longer move significantly, or a set number of iterations has been reached. Essentially, the algorithm is trying to find when the clusters are relatively stable and the data points aren't switching around between clusters.
Data preparation
The libraries we'll use are for data manipulation and visualization, and KMeans
is the specific function we'll use to perform the clustering. Let's start with importing the necessary libraries:
# Import necessary library
import pandas as pd
Next, let's load the dataset (it takes a while):
# Load the dataset
retail_df = pd.read_excel('./data/Online_Retail.xlsx')
# Check basic information about the dataset
retail_df.info()
Out:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 InvoiceNo 541909 non-null object
1 StockCode 541909 non-null object
2 Description 540455 non-null object
3 Quantity 541909 non-null int64
4 InvoiceDate 541909 non-null datetime64[ns]
5 UnitPrice 541909 non-null float64
6 CustomerID 406829 non-null float64
7 Country 541909 non-null object
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB
Next, we'll clean the data to ensure our data is in the right format, and we'll need to handle missing or unusual values.
In this data cleaning step, we are removing any rows that don't have a CustomerID, or where the quantity or price is zero or negative. These are unlikely to be meaningful for the clustering. We're also removing any duplicated records.
Check the data again after cleaning it to understand what each column represents and the distribution of values.
# Remove any rows with missing values, negative quantities or prices
retail_df = retail_df[retail_df['Quantity'] > 0]
retail_df = retail_df[retail_df['UnitPrice'] > 0]
retail_df = retail_df[retail_df['CustomerID'].notnull()]
# Convert 'CustomerID' column to integer type
retail_df['CustomerID'] = retail_df['CustomerID'].astype(int)
# Check cleaned dataset information
retail_df.isnull().sum(), retail_df.shape
Out:
(None,
InvoiceNo 0
StockCode 0
Description 0
Quantity 0
InvoiceDate 0
UnitPrice 0
CustomerID 0
Country 0
dtype: int64,
(397884, 8))
# Remove any duplicate records
retail_df.drop_duplicates(inplace=True)
retail_df.shape # Print the shape of dataset to confirm duplicates removal
Out:
(392692, 8)
The 'CustomerID' is converted to integer type to ensure that it's handled right in the following calculations. Even though 'CustomerID' fields seem to be numbers, they might be stored as floats or strings in the original dataset. By converting the 'CustomerID' column to integer, we ensure that they are handled as whole numbers, which is more suitable and memory efficient for an ID field. If the column was already an integer, this operation wouldn't have any effect.
Then we count the total number of missing or null values in each column to quickly check if there is missing data in the DataFrame that might need to be handled.
Then we check how many rows (data points) and columns (variables) are in the DataFrame from its dimensionality (shape).
Once the data is ready, we'll proceed with preparing the specific columns that we will use for the clustering. This includes creating the SaleAmount
column and calculating aggregated metrics like frequency, total sales amount and recency of purchase for each customer.
# Add 'SaleAmount' column
retail_df['SaleAmount'] = retail_df['UnitPrice'] * retail_df['Quantity']
# Create a new dataframe with each customer's frequency, total sales amount, and recency of purchase
aggregations = {
'InvoiceNo':'count',
'SaleAmount':'sum',
'InvoiceDate':'max'
}
customer_df = retail_df.groupby('CustomerID').agg(aggregations)
customer_df = customer_df.reset_index()
# Rename columns
customer_df = customer_df.rename(columns = {'InvoiceNo':'Freq', 'InvoiceDate':'ElapsedDays'})
The 'SaleAmount' column is used in the aggregation function to calculate the 'Total Sales Amount' for each customer. And then we group them by 'CustomerID'. For each group (each unique 'CustomerID'), we count the
InvoiceNo
(the number of purchases), findSaleAmout
(the total sales amount) and getInvoiceDate
(the latest purchase date).After the aggregation operation
count
is applied, the resulting data inInvoiceNo
column now represents the frequency of purchases for each customer, not the actual Invoice Numbers.Similarly, after the operation
max
is applied and the recent date is subtracted fromInvoiceDate
column, the resulting data now represents the number of days elapsed since the last purchase for each customer, not the actual invoice dates.
customer_df.head(), customer_df.shape
( CustomerID Freq SaleAmount ElapsedDays
0 12346 1 77183.60 2011-01-18 10:01:00
1 12347 182 4310.00 2011-12-07 15:52:00
2 12348 31 1797.24 2011-09-25 13:13:00
3 12349 73 1757.55 2011-11-21 09:51:00
4 12350 17 334.40 2011-02-02 16:01:00,
(4338, 4))
Then we scale the features and standardize this prepared data before we actually run the k-means algorithm.