Building an Email Spam Detection Model with Logistic Regression in Python using Scikit-learn

Table of contents

In this blog post, we will explore how to use logistic regression to classify emails as spam or not spam. We will use the scikit-learn library in Python to build and evaluate our model.

Step 1: Import the necessary libraries We will start by importing the necessary libraries. We will use the following libraries in our code:

  • pandas: to load and manipulate the dataset

  • seaborn: to visualize the data

  • scikit-learn: to build and evaluate the logistic regression model

Step 2: Load the dataset We will load the email dataset into a pandas dataframe. The dataset contains the following columns:

  • Email No.: the unique identifier for each email

  • predictions: 0 if the email is not spam, and 1 if it is spam

  • Other columns: features of the email, such as subject, sender, and message body

Step 3: Exploratory Data Analysis We will explore the dataset to get an idea of its structure and distribution. We will use the info() method to check the data types and missing values in the dataset. We will also use seaborn's histplot to visualize the distribution of the predictions variable.

Step 4: Data Preprocessing We will preprocess the data by handling missing values and balancing the dataset. We will drop rows with missing values and resample the minority class (spam emails) to match the majority class (non-spam emails) using the resample function from scikit-learn.

We will also scale the data using scikit-learn's StandardScaler to ensure that all features have the same scale.

Step 5: Train-Test Split We will split the data into training and testing sets using scikit-learn's train_test_split function. We will use 80% of the data for training and 20% for testing.

Step 6: Model Building We will build a logistic regression model using scikit-learn's LogisticRegression class. We will fit the model to the training data and use it to predict the test data.

Step 7: Model Evaluation We will evaluate the model's performance using accuracy, confusion matrix, and classification report. We will also plot the ROC curve and calculate the AUC score to evaluate the model's ability to distinguish between the two classes.

Complete code

#Step 1: Import the necessary libraries
from sklearn.metrics import classification_report
import pandas as pd
from sklearn.utils import resample
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

#Step 2: Load the dataset
data_set = pd.read_csv('emails.csv')

#Step 3: Exploratory Data Analysis
# Check the data types and missing values
data_set.info()

# Check the distribution of the variables
sns.histplot(data_set['predictions'], kde=False)
plt.show()

#Step 4: Data Preprocessing
# Handling missing values
data_set.dropna(inplace=True)
# Resampling to balance the data
df_majority = data_set[data_set['predictions']==0]
df_minority1 = data_set[data_set['predictions']==1]

n_samples = len(df_majority)
df_minority1_upsampled = resample(df_minority1,
                                  replace=n_samples,     # sample with replacement
                                  n_samples=n_samples,  # to match majority class
                                  random_state=42)  # reproducible results

# df_minority2_upsampled = resample(df_minority2,
#                                   replace=True,     # sample with replacement
#                                   n_samples=n_samples,  # to match majority class
#                                   random_state=42)  # reproducible results

data_set = pd.concat([df_majority, df_minority1_upsampled])

target_dist = data_set['predictions'].value_counts()
print(f"Target Distribution:\n{target_dist}")

# Check the distribution of the variables after resampling
sns.histplot(data_set['predictions'], kde=False)
plt.show()

# Scaling the data
scaler = StandardScaler()
X = scaler.fit_transform(data_set.drop(columns=['predictions', 'Email No.']))
y = data_set['predictions']

#Step 5: Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#Step 6: Model Building
# Create the logistic regression model
lr_model = LogisticRegression()

# Fit the model to the training data
lr_model.fit(X_train, y_train)

# Predict the test data
y_pred = lr_model.predict(X_test)

#Step 7: Model Evaluation
# Calculate the accuracy score
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

# Calculate the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print(f"Confusion Matrix: {conf_matrix}")

# Generate the classification report
class_report = classification_report(y_test, y_pred)
print(f"Classification Report:\n{class_report}")


from sklearn.metrics import roc_curve, roc_auc_score

# Get the predicted probabilities for the test data
y_proba = lr_model.predict_proba(X_test)[:, 1]

# Calculate the FPR, TPR, and thresholds for different classification thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_proba)

# Plot the ROC curve
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

# Calculate the AUC
auc = roc_auc_score(y_test, y_proba)
print(f"AUC: {auc}")

Conclusion

In this blog post, we explored how to use logistic regression to classify emails as spam or not spam. We used the scikit-learn library in Python to build and evaluate our model. Logistic regression is a powerful and simple algorithm for classification tasks, and it can be used in a wide range of applications. By following these steps, you can build your own logistic regression model for email classification or other classification tasks.

Based on the classification report, confusion matrix, and AUC score, our logistic regression model is performing very well with an accuracy of 1.0 and an AUC of 1.0, indicating that it is able to accurately predict whether an email is spam or not. It's important to note that this high level of accuracy could also be due to the balanced nature of the dataset after resampling. However, this model is a good starting point for building more complex models for email spam detection.