Additions

We so far have analyzed the Online Retail dataset with K-Means clustering and grouped customers based on their purchasing behavior. The process required

cleaning the data,
adding a feature ('SaleAmount') that can well-describe the purchasing behavior,
normalization and scaling,
finding the optimal number of clusters (with the elbow method & silhouette score) and
fitting and refitting the model.

This analysis is typically for customer segmentation in marketing strategies and to understand customer behaviors. Some of the actions we can take upon this segmentation are:

Interpret the Clusters
Visualize the Clusters

You can visualize the clusters in a 3D scatter plot using the three features ('Frequency', 'SaleAmount', and 'ElapsedDays'). This might give you a more intuitive understanding of the customer groups.
Customer Profiling

You can further profile the customers within each cluster based on other available information (like demographics, preferred product categories, or preferred shopping times).
Develop Marketing Strategies
Predictive Modeling

The cluster labels also can be a target variable when training a classification model. You can predict the cluster of new customers based on the existing cluster labels.
Repeat the Analysis

Regularly repeat the clustering analysis (for example, every quarter or every year) to catch changes in customer behavior over time.
Experiment and Refine

Experiment with different numbers of clusters, other clustering algorithms, and different features to segment on.

I recommend to:
- Add a small constant to the feature (especially to 'ElapsedDays' in this case) and train the model again, and see if you get a different result.
- Drop the entries with outlier values or unusual behavior or modify them by adding a little offset or something to lessen the impact of outliers on the result. The method I got recommended from ChatGPT is Z-score method but I'm not including it in this series.

You can save the result and do some interpreting and analysis-like. These are the additional steps but might still be required in writing papers or real-life scenarios.

# Drop columns before saving the result
customer_df = customer_df.drop(['Freq_log', 'SaleAmount_log', 'ElapsedDays_log', 
                                'Freq_scaled', 'SaleAmount_scaled', 'ElapsedDays_scaled'], axis=1)

customer_df.to_csv('./data/customer_segments.csv', index=False)

# Load the dataframe from a CSV file
cluster_df = pd.read_csv('./data/customer_segments.csv')

cluster_df.head()

Out:

	CustomerID	Freq	SaleAmount	ElapsedDays	ClusterLabel
0	12346	1	77183.60	325	2
1	12347	182	4310.00	1	1
2	12348	31	1797.24	74	2
3	12349	73	1757.55	18	2
4	12350	17	334.40	309	0

# Calculate average spent per transaction
cluster_df['AvgSpent'] = cluster_df['SaleAmount'] / cluster_df['Freq']

cluster_df.head()

Out:

	CustomerID	Freq	SaleAmount	ElapsedDays	ClusterLabel	AvgSpent
0	12346	1	77183.60	325	2	77183.600000
1	12347	182	4310.00	1	1	23.681319
2	12348	31	1797.24	74	2	57.975484
3	12349	73	1757.55	18	2	24.076027
4	12350	17	334.40	309	0	19.670588

cluster_df.groupby('ClusterLabel').mean()

Out:

	CustomerID	Freq	SaleAmount	ElapsedDays	AvgSpent
ClusterLabel
0	15381.412393	15.213675	299.475934	183.400997	42.764799
1	15184.131227	286.616039	7283.267533	11.043742	101.034628
2	15233.974419	81.235659	1538.726421	89.472868	97.774559
3	15382.825822	37.336175	593.981023	18.369062	33.272688

That was the completion of k-means clustering process. References for Z-score method and RFM analysis below and the script from the posts.

Example for implementation of Z-score measuring:

from scipy import stats

# Define a threshold for outliers
threshold = 3 # sets the minimum Z-score

# Calculate Z-scores
z_scores = stats.zscore(cluster_df[['Freq', 'SaleAmount', 'ElapsedDays', 'AvgSpent']])

# Get boolean mask of outliers
outliers = np.abs(z_scores) > threshold

# Check the outlier rows
cluster_df[outliers.any(axis=1)].info()

Example of RFM segmentation:

ClusterLabel	Recency	Frequency	Monetary	RFM_Segment	RFM_Score
0	2	1	1	211	4
1	5	5	5	555	15
2	3	3	3	333	9
3	4	2	2	422	8

ClusterLabel	Customer Segment	Type of Customers	Characteristics	Recommended Marketing Strategy
0	At Risk	Infrequent, lower spending customers	These customers have lower frequency, monetary value and high recency. They have made a purchase relatively recently but don't make purchases very often, and they tend to spend a smaller amount on each purchase.	Focus on re-engagement strategies to increase their purchase frequency and monetary value.
1	Champions	Frequent, high spending customers	These customers score the highest in all RFM categories. They've made a purchase very recently, make purchases frequently, and spend a lot on each purchase.	Focus on loyalty programs and exclusive services to maintain their satisfaction. The referral program might also suit this group.
2	Potential Loyal Customers	Less frequent but high spending customers	These customers have a high monetary value but lower frequency. Their recency is also lower, meaning it's been a while since their last purchase.	You can grow this segment into loyal customers through relevant recommendations and loyalty programs.
3	New Customers	Recent but less frequent and lower spending customers	These customers have a high recency but lower frequency and monetary value. These are customers that have recently made their first purchases.	These customers can be turned into repeat customers. Understand their needs and preferences and provide targeted marketing.

Eujeen Han's Blog

Eujeen Han's Blog

Additions