Data Science For Good — Machine Learning for Heart Disease Prediction
Machine Learning algorithms are used in a wide variety of applications and among them is the healthcare industry. According to Wikipedia, Machine Learning is the scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead. It is seen as a subset of artificial intelligence.
Machine learning algorithms build a mathematical model based on sample data, known as “training data”, in order to make predictions or decisions without being explicitly programmed to perform the task.
I have been wanting to do a beginner level machine learning project for some time now and I was so excited when I came across a machine learning competition hosted by DrivenData which you can find here Warm Up: Machine Learning with a Heart. It would be amazing if you gave it a try as well.
Reasons, why I chose this dataset, is because, according to drivendata.org, this is one of the smallest datasets on DrivenData making it a great place to dive into the world of data science competitions.
It is further explained that if you’re looking to use data science for good the project is right for you.
About the Data
The dataset is from a study of heart disease that has been open to the public for many years. The study collects various measurements on patient health and cardiovascular statistics, and of course, makes patient identities anonymous.
Data is provided courtesy of the Cleveland Heart Disease Database via the UCI Machine Learning repository.
Aha, D., and Dennis Kibler. “Instance-based prediction of heart-disease presence with the Cleveland database.” University of California 3.1 (1988): 3–2.
PLEASE NOTE: The competition is for fun, and the data is available for use outside of DrivenData. If you use the dataset for projects, you are encouraged to share your work on the forum or on twitter! If you publish work, please cite the DrivenData platform paper.
In this post with the above use case, I’ll illustrate how I attempted to work on the project from loading the data, creating graphs and the simple models used for predicting heart disease from the patient data. (Logistic, Random Forest and Boosted Logistic models)
About Heart Disease
According to drivendata.org, preventing heart disease is important. Good data-driven systems for predicting heart disease can improve the entire research and prevention process, making sure that more people can live healthy lives.
In the United States, the Centers for Disease Control and Prevention is a good resource for information about heart disease. According to their website:
- About 610,000 people die of heart disease in the United States every year–that’s 1 in every 4 deaths.
- Heart disease is the leading cause of death for both men and women. More than half of the deaths due to heart disease in 2009 were in men.
- Coronary heart disease (CHD) is the most common type of heart disease, killing over 370,000 people annually.
- Every year about 735,000 Americans have a heart attack. Of these, 525,000 are a first heart attack and 210,000 happen in people who have already had a heart attack.
- Heart disease is the leading cause of death for people of most ethnicities in the United States, including African Americans, Hispanics, and whites. For American Indians or Alaska Natives and Asians or Pacific Islanders, heart disease is second only to cancer.
For more information, you can look at the website of the Centers for Disease Control and Prevention: preventing heart disease
Your goal is to predict the binary class
heart_disease_present, which represents whether or not a patient has heart disease:
0represents no heart disease present
1represents heart disease present
There are 14 columns in the dataset, where the
patient_id column is a unique and random identifier. The remaining 13 features are described in the section below.
slope_of_peak_exercise_st_segment(type: int): the slope of the peak exercise ST segment, an electrocardiography read out indicating quality of blood flow to the heart
thal(type: categorical): results of thallium stress test measuring blood flow to the heart, with possible values
resting_blood_pressure(type: int): resting blood pressure
chest_pain_type(type: int): chest pain type (4 values)
num_major_vessels(type: int): number of major vessels (0-3) colored by flourosopy
fasting_blood_sugar_gt_120_mg_per_dl(type: binary): fasting blood sugar > 120 mg/dl
resting_ekg_results(type: int): resting electrocardiographic results (values 0,1,2)
serum_cholesterol_mg_per_dl(type: int): serum cholestoral in mg/dl
oldpeak_eq_st_depression(type: float): oldpeak = ST depression induced by exercise relative to rest, a measure of abnormality in electrocardiograms
age(type: int): age in years
max_heart_rate_achieved(type: int): maximum heart rate achieved (beats per minute)
exercise_induced_angina(type: binary): exercise-induced chest pain (
And the coding begins!
If you missed the link to the data, you can find it here data.
- The Packages Required
Note: If you don’t have a certain package installed, you can use the install.packages(“<package name>”) function.
#Load the packageslibrary(tidyverse) #data cleaning
library(skimr) #descriptive statistics
library(caret) #machine learning models
2. Get the data into R
Notice that there are 2 train data sets; train values and train labels. The train values dataset contains all variables except the dependent variable, heart_disease_present while the train labels data set contains only the patient_id and heart_disease_present variable.
There is a need to merge the 2 datasets by the unique variable, patient_id for effective analysis and we will do that in R using the
#The Datatrain_values <- read.csv("C:/Users/MARGRET/Downloads/train_values.csv")
train_labels <- read.csv("C:/Users/MARGRET/Downloads/train_labels.csv")#Merge the 2 data setsheart_data <- merge(train_values, train_labels, by="patient_id")
NOTE: our data is already split into train and test data. To avoid cleaning the data twice, let’s merge both data sets then split them out again when building our model.
The test data is also short on one variable. The target variable. Which is heart_disease_present.
To be able to merge the train and test data, we will create the heart_disease_present variable and assign it NA(missing) values, as this is what we will be predicting.
test_values <- read.csv("C:/Users/MARGRET/Downloads/test_values.csv")#Add heart_disease_present to test data settest_values$heart_disease_present <- NA#Merge the new train data and test dataheart_data <- rbind(heart_data, test_values)
3. Know your data
Let’s get an overview of our data
#Print the first 6 lineshead(heart_data)#dimensions of the data
#270 rows and 15 columnsdim(heart_data)#data structurestr(heart_data)# list types for each attributesapply(heart_data, class)#data summarysummary(heart_data)#column namescolnames(heart_data)
Any missing values?
#checking for missing valuessum(is.na(heart_data))heart_data %>%
The data is pretty small, that is, 270 rows and 15 columns. The train data has 180 rows, 15 columns while the test data; 90 rows and 14 columns.
There are 90 missing values in our data. Remember, when we assigned NA values to our data set? This is it. Therefore, there is no need to worry about this.
Remember that our target variable(dependent variable) should be a categorical variable with binary (1 or 0) values.
Another great way to check for descriptive statistics of data is by using the “skimr ”package.
skimr::skim_to_wide() produces a nice data frame containing the descriptive stats of each of the columns. The data frame output includes a nice histogram drawn without any plotting help.
skimmed_data <- skim_to_wide(heart_data)
Here is how the output looks like. Pretty fast and great, don’t you think?
A few things to note from the descriptive statistics; the age variable is normally distributed. Hence, there is no bias in the data set used.
4. Visualize the data
Say, you want to get an idea of how your data looks like. In this case, graphs would do. Scatter plots, for instance, can give you a great idea of what you’re dealing with: it can be interesting to see how much one variable is affected by another. But first, let’s get an overview of the response variable count;
ggplot(heart_data, aes(x = heart_disease_present, fill = factor(heart_disease_present))) +
If you want to see if there is any correlation between any 2 variables.
Note that you first need to load the
ggvis package(check the packages loaded above)
a) Checking the relationship between age and resting blood pressure and whether heart disease is present or not(this is distinguished by the colours)
Code; (Due to the high number(90) of missing values in the heart disease present variable in the test data, I decided to omit the test data and just get a clearer viz of the train data )
#Data Viz# scatter plotheart_data %>%
ggvis(~age, ~resting_blood_pressure, fill = ~heart_disease_present) %>%
You see, the correlation is somewhat low in either the presence or absence of heart disease. The data points seem to form a cluster per age range.
b) Checking the relationship between age and resting blood pressure by sex. (this is distinguished by the colours) i.e between the male and female, who suffers higher blood pressure?
ggvis(~age, ~resting_blood_pressure, fill = ~sex) %>%
Note, female = 0 and male = 1. While the correlation is somewhat less high for both the male and female, the data points are more spread out over the graph.
c) Checking the relationship between age and heartbeats per minute by the chest pain type. (this is distinguished by the colours)
ggvis(~age, ~max_heart_rate_achieved, fill = ~ chest_pain_type) %>%
From the above plot, one can see that there seems to be a negative correlation between heartbeats per minute and age per chest pain type. Chest pain type 4(red colour) seems to be more prevalent around the age of between 40 and 70 than the other types.
A quick look at the heart_disease_present attribute tells one that the percentual division of the absence/presence of heart disease is 55.6–44.4.
#Quick overview of the target variable - `heart_disease_present`# Division of `heart_disease_present`table(heart_data$heart_disease_present)# Percentual division of `heart_disease_present`
# digits = 1 means rounded to 1 decimal placeround(prop.table(table(heart_data$heart_disease_present)) * 100, digits = 1)
5. What Next? Machine Learning Models to Use
The task was to evaluate machine learning models to predict the absence/presence of heart disease.
Training And Test Data
In order to assess the model’s performance later, we will need to divide the data set into two parts: a training set and a test set(as had been given by the data source).
The reason is when building models, the algorithm should only see the training data to learn the relationship between X and Y. This learned information forms what is called a machine learning model.
The model is then used to predict the Y in test data by looking at only the X values of test data. Finally the predicted values of Y from test dataset to evaluate how good the model really is.
#Train and test datatrain_data <- heart_data[1:180, ]
test_data <- heart_data[181:270, ]
Building the Models
Using the “caret” package in R, let’s create some models using the train data and estimate their accuracy on test data.
One of the biggest challenge beginners in machine learning face is which algorithms to learn and focus on. However, Caret short for Classification And Regression Training integrates all activities related to model development in a streamlined workflow for nearly every major ML algorithm available in R.
Here is how to get the current list of supported models:
Let’s evaluate 3 different algorithms:
- Logistic Regression
- Random Forest (RF)
- Boosted Logistic Regression
- Logistic Regression
We will use the set.seed() function for reproducible results. The function takes an (arbitrary) integer argument say 1, 123 or 223 which is useful for creating simulations or random objects that can be reproduced.
# 1. Logistic Regression# Fit the logistic regression model, that is a GLM using a binomial link
# using the caret function train().set.seed(112)mod_fit <- train(as.factor(heart_disease_present) ~ . - patient_id,
data = train_data, method = "glm", family = "binomial"
)mod_fit # Accuracy - 78.37%
2. Random Forest
# 2. Random Forestset.seed(112)mod_fit_rf <- train(as.factor(heart_disease_present) ~ . - patient_id,
data = train_data, method = "rf"
)mod_fit_rf # Accuracy - 81.01%
3. Boosted Logistic Regression
# 3. Boosted Logistic Regressionset.seed(112)log_model <- train(as.factor(heart_disease_present) ~ . - patient_id,
data = train_data, method = "LogitBoost"
)log_model # Accuracy - 76.5%
Selecting the Best Model
We can create a plot of the model evaluation results and compare the spread and the mean accuracy of each model.
# Checking for the best model using a graphresults <- resamples(list(logistic = mod_fit, random_forest = mod_fit_rf, boosted_logistic = log_model))
We can see that the most accurate model is the random forest model.
It is also possible to check the order of the most important variables. In this case, I will pass the varImp() function to the most accurate model.
# Variable Importancefeature_importance <- varImp(mod_fit_rf, scale = FALSE)
6. Finally, making the predictions
The random forest was the most accurate model. Now we want to get an idea of the accuracy of the model on our test data using the predict() function.
# Using random forest as our best model, let's predict our resultspredictions <- predict(mod_fit_rf, test_data)predictions
Predicting probabilities of either absence/ presence of heart disease based on the given variables,
#probabilitiespredict(mod_fit, newdata=test_data, type="prob")
Below is a snippet of the output
We can see that patient 181 has a 70% chance of absence of heart disease while patient 183 has a 94% predicted chance of having heart disease. Patient 194 has a 99% predicted chance of having heart disease.
One can delve deeper into the independent variables of each patient id to figure out why this is the case. For instance, patient 194 was aged 63, had a chest pain type 4(the highest) and a resting blood pressure of 150(greater than the 3rd quartile). These not so good health conditions are probably why the patient had a 99% probability of heart disease presence.
In this post, my aim was to showcase my contribution to this machine learning project in R. Hopefully, you’ve made it till here. Am open to new ideas and suggestions to more models that can be used as I understand there is more to be done.
Feel free to use the recipes in this post to evaluate your machine learning algorithms on your current or next machine learning project.
Peter Bull, Isaac Slavitt, Greg Lipstein. “Harnessing the Power of the Crowd to Increase Capacity for Data Science in the Social Sector” (Submitted on 24 Jun 2016). Cite as arXiv:1606.07781 [cs.HC]
Really great resources
- Machine learning in R step by step