Additions

·

5 min read

We so far have analyzed the Online Retail dataset with K-Means clustering and grouped customers based on their purchasing behavior. The process required

  • cleaning the data,

  • adding a feature ('SaleAmount') that can well-describe the purchasing behavior,

  • normalization and scaling,

  • finding the optimal number of clusters (with the elbow method & silhouette score) and

  • fitting and refitting the model.

This analysis is typically for customer segmentation in marketing strategies and to understand customer behaviors. Some of the actions we can take upon this segmentation are:

  1. Interpret the Clusters

  2. Visualize the Clusters

    You can visualize the clusters in a 3D scatter plot using the three features ('Frequency', 'SaleAmount', and 'ElapsedDays'). This might give you a more intuitive understanding of the customer groups.

  3. Customer Profiling

    You can further profile the customers within each cluster based on other available information (like demographics, preferred product categories, or preferred shopping times).

  4. Develop Marketing Strategies

  5. Predictive Modeling

    The cluster labels also can be a target variable when training a classification model. You can predict the cluster of new customers based on the existing cluster labels.

  6. Repeat the Analysis

    Regularly repeat the clustering analysis (for example, every quarter or every year) to catch changes in customer behavior over time.

  7. Experiment and Refine

    Experiment with different numbers of clusters, other clustering algorithms, and different features to segment on.

    I recommend to:

    • Add a small constant to the feature (especially to 'ElapsedDays' in this case) and train the model again, and see if you get a different result.

    • Drop the entries with outlier values or unusual behavior or modify them by adding a little offset or something to lessen the impact of outliers on the result. The method I got recommended from ChatGPT is Z-score method but I'm not including it in this series.


You can save the result and do some interpreting and analysis-like. These are the additional steps but might still be required in writing papers or real-life scenarios.

# Drop columns before saving the result
customer_df = customer_df.drop(['Freq_log', 'SaleAmount_log', 'ElapsedDays_log', 
                                'Freq_scaled', 'SaleAmount_scaled', 'ElapsedDays_scaled'], axis=1)

customer_df.to_csv('./data/customer_segments.csv', index=False)
# Load the dataframe from a CSV file
cluster_df = pd.read_csv('./data/customer_segments.csv')

cluster_df.head()

Out:

CustomerIDFreqSaleAmountElapsedDaysClusterLabel
012346177183.603252
1123471824310.0011
212348311797.24742
312349731757.55182
41235017334.403090
# Calculate average spent per transaction
cluster_df['AvgSpent'] = cluster_df['SaleAmount'] / cluster_df['Freq']

cluster_df.head()

Out:

CustomerIDFreqSaleAmountElapsedDaysClusterLabelAvgSpent
012346177183.60325277183.600000
1123471824310.001123.681319
212348311797.2474257.975484
312349731757.5518224.076027
41235017334.40309019.670588
cluster_df.groupby('ClusterLabel').mean()

Out:

CustomerIDFreqSaleAmountElapsedDaysAvgSpent
ClusterLabel
015381.41239315.213675299.475934183.40099742.764799
115184.131227286.6160397283.26753311.043742101.034628
215233.97441981.2356591538.72642189.47286897.774559
315382.82582237.336175593.98102318.36906233.272688

That was the completion of k-means clustering process. References for Z-score method and RFM analysis below and the script from the posts.

  • Example for implementation of Z-score measuring:
from scipy import stats

# Define a threshold for outliers
threshold = 3 # sets the minimum Z-score

# Calculate Z-scores
z_scores = stats.zscore(cluster_df[['Freq', 'SaleAmount', 'ElapsedDays', 'AvgSpent']])

# Get boolean mask of outliers
outliers = np.abs(z_scores) > threshold

# Check the outlier rows
cluster_df[outliers.any(axis=1)].info()
  • Example of RFM segmentation:
ClusterLabelRecencyFrequencyMonetaryRFM_SegmentRFM_Score
02112114
155555515
23333339
34224228
ClusterLabelCustomer SegmentType of CustomersCharacteristicsRecommended Marketing Strategy
0At RiskInfrequent, lower spending customersThese customers have lower frequency, monetary value and high recency. They have made a purchase relatively recently but don't make purchases very often, and they tend to spend a smaller amount on each purchase.Focus on re-engagement strategies to increase their purchase frequency and monetary value.
1ChampionsFrequent, high spending customersThese customers score the highest in all RFM categories. They've made a purchase very recently, make purchases frequently, and spend a lot on each purchase.Focus on loyalty programs and exclusive services to maintain their satisfaction. The referral program might also suit this group.
2Potential Loyal CustomersLess frequent but high spending customersThese customers have a high monetary value but lower frequency. Their recency is also lower, meaning it's been a while since their last purchase.You can grow this segment into loyal customers through relevant recommendations and loyalty programs.
3New CustomersRecent but less frequent and lower spending customersThese customers have a high recency but lower frequency and monetary value. These are customers that have recently made their first purchases.These customers can be turned into repeat customers. Understand their needs and preferences and provide targeted marketing.