To train and predict

·

3 min read

The dataset used is the breast cancer Wisconsin dataset from scikit-learn.

from sklearn.datasets import load_breast_cancer 
# Load the dataset
b_cancer = load_breast_cancer()

# Check the description
print(b_cancer.DESCR)

b_cancer is a Bunch (dictionary-like) object, .target attribute is a 1D array contains binary variable indicating whether the tumor is malignant (0) or benign (1).

X (input features) will be a dataframe with Y column (target feature, 'is_benign' column) dropped. We're going to keep X and Y based on the same dataframe so it's clear where they're coming from.

import pandas as pd  
# Create a DataFrame
df = pd.DataFrame(b_cancer.data, columns=b_cancer.feature_names)

# Add the target column to the DataFrame
df['is_benign'] = b_cancer.target

# Check the DataFrame
df.info()
# Create feature and target arrays
X = df.drop('is_benign', axis=1)
Y = df['is_benign']

X.columns, Y

Now we're splitting our data into a training set and a test set. And then we scale the features to train the model. Next, we used the trained model to make predictions on the testing data X_test.

from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import StandardScaler 

# Train-test split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=0)

# Scale the input features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Notice that we're only scaling the X_train so that X_test can be scaled by the scaler that holds the mean and standard deviation of the training set.


During the learning phase, the model uses the maximum likelihood estimation method to find the model parameters (the weights and bias) that minimize the log loss over the training dataset. The maximum likelihood estimation finds the parameters that make our observed data (Y_train) most probable. The log loss function minimizes how far off our estimates are from the Y_train.

The model learned the relationship between features and the target to put it in a nutshell. Once the model is trained, it applies the learned weights and bias to the new set of features (X_test) and predicts the probabilities (Y_pred) with the logistic function.

from sklearn.linear_model import LogisticRegression 

# Train the model
model = LogisticRegression()
model.fit(X_train, Y_train)

# Make predictions on the test data
Y_pred = model.predict(X_test)

Finally, we evaluate the performance of the model with Y_pred (predictions) and Y_test (the actual values). The evaluation metrics we'll use are the confusion matrix, accuracy, precision, recall, F1 score, and ROC AUC score.

from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Evaluate the model
cm = confusion_matrix(Y_test, Y_pred)
acc = accuracy_score(Y_test, Y_pred)
prec = precision_score(Y_test, Y_pred)
rec = recall_score(Y_test, Y_pred)
f1 = f1_score(Y_test, Y_pred)
roc_auc = roc_auc_score(Y_test, Y_pred)

# Check out the metrics
print(f"Confusion Matrix: \n{cm}")
print(f"Accuracy: {acc}")
print(f"Precision: {prec}")
print(f"Recall: {rec}")
print(f"F1 Score: {f1}")
print(f"ROC AUC Score: {roc_auc}")

Out:

Confusion Matrix: 
[[ 60   3]
 [  1 107]]
Accuracy: 0.9766081871345029
Precision: 0.9727272727272728
Recall: 0.9907407407407407
F1 Score: 0.981651376146789
ROC AUC Score: 0.9715608465608465
  • The confusion matrix is a summary of prediction results for a classification problem.

  • Accuracy measures the proportion of the total number of predictions that were correct.

  • Precision measures the proportion of positive predictions that were actually correct.

  • Recall (also known as sensitivity) measures the proportion of actual positives that were identified correctly.

  • F1 score is the harmonic mean of precision and recall and tries to balance both.

  • The ROC AUC score summarizes the performance of a binary classification model, it's a plot of the false positive rate (x-axis) versus the true positive rate (y-axis) for a number of different candidate threshold values between 0.0 and 1.0.