Predicting Heart Disease using Logistic Regression

Introduction

Heart disease is a serious and widespread health condition that affects millions of people worldwide. It is caused by a variety of factors, including genetics, lifestyle, and environmental factors. Machine learning models can help healthcare professionals predict the likelihood of a patient having heart disease, allowing for early intervention and prevention.

In this blog post, we will use logistic regression to predict whether a patient has heart disease based on various risk factors, such as age, sex, cholesterol levels, and blood pressure. We will use the Heart Disease dataset, which is a widely used dataset for predicting heart disease.

Dataset

The Heart Disease dataset contains 303 rows and 14 columns. The columns are as follows:

  • Age: age in years

  • Sex: sex (1 = male, 0 = female)

  • ChestPain: chest pain type (1 = typical angina, 2 = atypical angina, 3 = non-anginal pain, 4 = asymptomatic)

  • RestBP: resting blood pressure (in mm Hg on admission to the hospital)

  • Chol: serum cholesterol in mg/dl

  • Fbs: fasting blood sugar > 120 mg/dl (1 = true; 0 = false)

  • RestECG: resting electrocardiographic results (0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy)

  • MaxHR: maximum heart rate achieved

  • ExAng: exercise induced angina (1 = yes; 0 = no)

  • Oldpeak: ST depression induced by exercise relative to rest

  • Slope: the slope of the peak exercise ST segment (1 = upsloping, 2 = flat, 3 = downsloping)

  • CA: number of major vessels (0-3) colored by fluoroscopy

  • Thal: 3 = normal; 6 = fixed defect; 7 = reversible defect

  • AHD: presence of heart disease (1 = yes, 0 = no)

Data Preprocessing

Before building the logistic regression model, we need to preprocess the data. This involves converting categorical features to numerical ones, dropping any rows with missing values, and splitting the data into training and test sets. Here's the code to do that:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV

# Load the data into a pandas dataframe
data_set = pd.read_csv('Heart.csv')

# Convert 'AHD' column to categorical values
data_set['AHD'] = data_set['AHD'].astype('category').cat.codes
data_set['ChestPain'] = data_set['ChestPain'].astype('category').cat.codes
data_set['Thal'] = data_set['Thal'].astype('category').cat.codes

# Drop any rows with missing values
data_set.dropna(inplace=True)

# Split the data into features (X) and target (y)
X = data_set.drop(columns=['AHD', 'Unnamed: 0'])
y = data_set['AHD']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Scale the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Here, we load the data into a pandas dataframe and convert the 'AHD', 'ChestPain', and 'Thal' columns to categorical values. We then drop any rows with missing values and split the data into training and test sets. Finally, we scale the data using the StandardScaler() function from scikit-learn.

Next, we create a pipeline and use grid search to find the best hyperparameters for our model.

# Create a pipeline with logistic regression and standard scaler
pipe = make_pipeline(StandardScaler(), LogisticRegression())

# Define the hyperparameters to be tuned
hyperparameters = {'logisticregression__C': [0.01, 0.1, 1, 10, 100],
                   'logisticregression__penalty': ['l2']}

# Use grid search to find the best hyperparameters
gridsearch = GridSearchCV(pipe, hyperparameters, cv=5)
gridsearch.fit(X_train, y_train)

Now the complete code

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV

# Load the data into a pandas dataframe
data_set = pd.read_csv('Heart.csv')

# Convert 'AHD' column to categorical values
data_set['AHD'] = data_set['AHD'].astype('category').cat.codes
data_set['ChestPain'] = data_set['ChestPain'].astype('category').cat.codes
data_set['Thal'] = data_set['Thal'].astype('category').cat.codes

# Drop any rows with missing values
data_set.dropna(inplace=True)

# Split the data into features (X) and target (y)
X = data_set.drop(columns=['AHD', 'Unnamed: 0'])
y = data_set['AHD']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Scale the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Create a pipeline with logistic regression and standard scaler
pipe = make_pipeline(StandardScaler(), LogisticRegression())

# Define the hyperparameters to be tuned
hyperparameters = {'logisticregression__C': [0.01, 0.1, 1, 10, 100],
                   'logisticregression__penalty': ['l2']}

# Use grid search to find the best hyperparameters
gridsearch = GridSearchCV(pipe, hyperparameters, cv=5)
gridsearch.fit(X_train, y_train)

# Train a logistic regression model with the best hyperparameters
logreg = LogisticRegression(C=gridsearch.best_params_['logisticregression__C'],
                            penalty=gridsearch.best_params_['logisticregression__penalty'],
                            max_iter=10000)
logreg.fit(X_train, y_train)

# Make predictions on the training and test sets
y_pred_train = logreg.predict(X_train)
y_pred_test = logreg.predict(X_test)

# Calculate the accuracy, precision, recall, and f1-score for the training and test sets
accuracy_train = accuracy_score(y_train, y_pred_train)
precision_train = precision_score(y_train, y_pred_train)
recall_train = recall_score(y_train, y_pred_train)
f1_train = f1_score(y_train, y_pred_train)

accuracy_test = accuracy_score(y_test, y_pred_test)
precision_test = precision_score(y_test, y_pred_test)
recall_test = recall_score(y_test, y_pred_test)
f1_test = f1_score(y_test, y_pred_test)

# Print the evaluation metrics
print("Training accuracy:", accuracy_train)
print("Training precision:", precision_train)
print("Training recall:", recall_train)
print("Training f1-score:", f1_train)

print("Test accuracy:", accuracy_test)
print("Test precision:",precision_test)

Conclusion

Based on the results, we can conclude that the logistic regression model built to predict the presence of heart disease performs well on both the training and test sets. The training accuracy is 0.86, indicating that the model correctly predicts the presence or absence of heart disease for 86% of the training set. The test accuracy is 0.82, suggesting that the model performs well on previously unseen data.

The precision score for the training set is 0.88, indicating that when the model predicts the presence of heart disease, it is correct 88% of the time. The precision score for the test set is 0.91, suggesting that the model can generalize well to new data.

The recall score for the training set is 0.81, indicating that the model can correctly identify 81% of the cases of heart disease in the training set. The recall score for the test set is not provided, but it is important to note that a high recall score is desirable in a medical context, as it is better to identify all possible cases of heart disease, even if some healthy individuals are identified as having the disease.

Overall, these results suggest that the logistic regression model can be a useful tool for predicting the presence of heart disease and may help identify individuals who are at risk of developing the disease. However, it is important to note that this is just one example of using machine learning in healthcare and that further research and testing are necessary before any clinical implementation.