Additions
We so far have analyzed the Online Retail dataset with K-Means clustering and grouped customers based on their purchasing behavior. The process required
cleaning the data,
adding a feature ('SaleAmount') that can well-describe the purchasing behavior,
normalization and scaling,
finding the optimal number of clusters (with the elbow method & silhouette score) and
fitting and refitting the model.
This analysis is typically for customer segmentation in marketing strategies and to understand customer behaviors. Some of the actions we can take upon this segmentation are:
Interpret the Clusters
Visualize the Clusters
You can visualize the clusters in a 3D scatter plot using the three features ('Frequency', 'SaleAmount', and 'ElapsedDays'). This might give you a more intuitive understanding of the customer groups.
Customer Profiling
You can further profile the customers within each cluster based on other available information (like demographics, preferred product categories, or preferred shopping times).
Develop Marketing Strategies
Predictive Modeling
The cluster labels also can be a target variable when training a classification model. You can predict the cluster of new customers based on the existing cluster labels.
Repeat the Analysis
Regularly repeat the clustering analysis (for example, every quarter or every year) to catch changes in customer behavior over time.
Experiment and Refine
Experiment with different numbers of clusters, other clustering algorithms, and different features to segment on.
I recommend to:
Add a small constant to the feature (especially to 'ElapsedDays' in this case) and train the model again, and see if you get a different result.
Drop the entries with outlier values or unusual behavior or modify them by adding a little offset or something to lessen the impact of outliers on the result. The method I got recommended from ChatGPT is Z-score method but I'm not including it in this series.
You can save the result and do some interpreting and analysis-like. These are the additional steps but might still be required in writing papers or real-life scenarios.
# Drop columns before saving the result
customer_df = customer_df.drop(['Freq_log', 'SaleAmount_log', 'ElapsedDays_log',
'Freq_scaled', 'SaleAmount_scaled', 'ElapsedDays_scaled'], axis=1)
customer_df.to_csv('./data/customer_segments.csv', index=False)
# Load the dataframe from a CSV file
cluster_df = pd.read_csv('./data/customer_segments.csv')
cluster_df.head()
Out:
CustomerID | Freq | SaleAmount | ElapsedDays | ClusterLabel | |
0 | 12346 | 1 | 77183.60 | 325 | 2 |
1 | 12347 | 182 | 4310.00 | 1 | 1 |
2 | 12348 | 31 | 1797.24 | 74 | 2 |
3 | 12349 | 73 | 1757.55 | 18 | 2 |
4 | 12350 | 17 | 334.40 | 309 | 0 |
# Calculate average spent per transaction
cluster_df['AvgSpent'] = cluster_df['SaleAmount'] / cluster_df['Freq']
cluster_df.head()
Out:
CustomerID | Freq | SaleAmount | ElapsedDays | ClusterLabel | AvgSpent | |
0 | 12346 | 1 | 77183.60 | 325 | 2 | 77183.600000 |
1 | 12347 | 182 | 4310.00 | 1 | 1 | 23.681319 |
2 | 12348 | 31 | 1797.24 | 74 | 2 | 57.975484 |
3 | 12349 | 73 | 1757.55 | 18 | 2 | 24.076027 |
4 | 12350 | 17 | 334.40 | 309 | 0 | 19.670588 |
cluster_df.groupby('ClusterLabel').mean()
Out:
CustomerID | Freq | SaleAmount | ElapsedDays | AvgSpent | |
ClusterLabel | |||||
0 | 15381.412393 | 15.213675 | 299.475934 | 183.400997 | 42.764799 |
1 | 15184.131227 | 286.616039 | 7283.267533 | 11.043742 | 101.034628 |
2 | 15233.974419 | 81.235659 | 1538.726421 | 89.472868 | 97.774559 |
3 | 15382.825822 | 37.336175 | 593.981023 | 18.369062 | 33.272688 |
That was the completion of k-means clustering process. References for Z-score method and RFM analysis below and the script from the posts.
- Example for implementation of Z-score measuring:
from scipy import stats
# Define a threshold for outliers
threshold = 3 # sets the minimum Z-score
# Calculate Z-scores
z_scores = stats.zscore(cluster_df[['Freq', 'SaleAmount', 'ElapsedDays', 'AvgSpent']])
# Get boolean mask of outliers
outliers = np.abs(z_scores) > threshold
# Check the outlier rows
cluster_df[outliers.any(axis=1)].info()
- Example of RFM segmentation:
ClusterLabel | Recency | Frequency | Monetary | RFM_Segment | RFM_Score |
0 | 2 | 1 | 1 | 211 | 4 |
1 | 5 | 5 | 5 | 555 | 15 |
2 | 3 | 3 | 3 | 333 | 9 |
3 | 4 | 2 | 2 | 422 | 8 |
ClusterLabel | Customer Segment | Type of Customers | Characteristics | Recommended Marketing Strategy |
0 | At Risk | Infrequent, lower spending customers | These customers have lower frequency, monetary value and high recency. They have made a purchase relatively recently but don't make purchases very often, and they tend to spend a smaller amount on each purchase. | Focus on re-engagement strategies to increase their purchase frequency and monetary value. |
1 | Champions | Frequent, high spending customers | These customers score the highest in all RFM categories. They've made a purchase very recently, make purchases frequently, and spend a lot on each purchase. | Focus on loyalty programs and exclusive services to maintain their satisfaction. The referral program might also suit this group. |
2 | Potential Loyal Customers | Less frequent but high spending customers | These customers have a high monetary value but lower frequency. Their recency is also lower, meaning it's been a while since their last purchase. | You can grow this segment into loyal customers through relevant recommendations and loyalty programs. |
3 | New Customers | Recent but less frequent and lower spending customers | These customers have a high recency but lower frequency and monetary value. These are customers that have recently made their first purchases. | These customers can be turned into repeat customers. Understand their needs and preferences and provide targeted marketing. |