× Home Careers Contact

Back

Predicting presence of Heart Diseases using Machine Learning

Machine Learning is used across many spheres around the world. The healthcare industry is no exception. Machine Learning can play an essential role in predicting presence/absence of Locomotor disorders, Heart diseases and more. Such information, if predicted well in advance, can provide important insights to doctors who can then adapt their diagnosis and treatment per patient basis.

In this article, I’ll discuss a project where I worked on predicting potential Heart Diseases in people using Machine Learning algorithms

Import libraries

I imported several libraries for the project:

numpy: To work with arrays
pandas: To work with csv files and dataframes
matplotlib: To create charts using pyplot, define parameters using rcParams and color them with cm.rainbow
warnings: To ignore all warnings which might be showing up in the notebook due to past/future depreciation of a feature
train_test_split: To split the dataset into training and testing data
StandardScaler: To scale all the features, so that the Machine Learning model better adapts to the dataset.

Next, I imported all the necessary Machine Learning algorithms.

# Basic
	import numpy as np
	import pandas as pd
	import matplotlib.pyplot as plt
	from matplotlib import rcParams
	from matplotlib.cm import rainbow
	%matplotlib inline
	import warnings
	warnings.filterwarnings('ignore')

	# Other libraries
	from sklearn.model_selection import train_test_split
	from sklearn.preprocessing import StandardScaler

	# Machine Learning
	from sklearn.neighbors import KNeighborsClassifier
	from sklearn.svm import SVC
	from sklearn.tree import DecisionTreeClassifier
	from sklearn.ensemble import RandomForestClassifier

Import dataset

After downloading the dataset from Kaggle, I saved it to my working directory with the name dataset.csv. Next, I used read_csv() to read the dataset and save it to the dataset variable.

Before any analysis, I just wanted to take a look at the data. So, I used the info() method.

dataset.info()

As you can see from the output above, there are a total of 13 features and 1 target variable. Also, there are no missing values so we don’t need to take care of any null values. Next, I used describe() method.

The method revealed that the range of each variable is different. The maximum value of age is 77 but for chol it is 564. Thus, feature scaling must be performed on the dataset.

Understanding the data

Correlation Matrix

To begin with, let’s see the correlation matrix of features and try to analyse it. The figure size is defined to 12 x 8 by using rcParams. Then, I used pyplot to show the correlation matrix. Using xticks and yticks, I’ve added names to the correlation matrix. colorbar() shows the colorbar for the matrix.

rcParams['figure.figsize'] = 20, 14

plt.matshow(dataset.corr())

plt.yticks(np.arange(dataset.shape[1]), dataset.columns)

plt.xticks(np.arange(dataset.shape[1]), dataset.columns)

plt.colorbar()

It’s easy to see that there is no single feature that has a very high correlation with our target value. Also, some of the features have a negative correlation with the target value and some have positive.

Histogram

The best part about this type of plot is that it just takes a single command to draw the plots and it provides so much information in return. Just use dataset.hist().

Let’s take a look at the plots. It shows how each feature and label is distributed along different ranges, which further confirms the need for scaling. Next, wherever you see discrete bars, it basically means that each of these is actually a categorical variable. We will need to handle these categorical variables before applying Machine Learning. Our target labels have two classes, 0 for no disease and 1 for disease.

Bar Plot for Target Class

It’s really essential that the dataset we are working on should be approximately balanced. An extremely imbalanced dataset can render the whole model training useless and thus, will be of no use. Let’s understand it with an example.

Let’s say we have a dataset of 100 people with 99 non-patients and 1 patient. Without even training and learning anything, the model can always say that any new person would be a non-patient and have an accuracy of 99%. However, as we are more interested in identifying the 1 person who is a patient, we need balanced datasets so that our model actually learns.

rcParams['figure.figsize'] = 8,6

plt.bar(dataset['target'].unique(), dataset['target'].value_counts(), color = ['red', 'green'])

plt.xticks([0, 1])

plt.xlabel('Target Classes')

plt.ylabel('Count')

plt.title('Count of each Target Class')

For x-axis I used the unique() values from the target column and then set their name using xticks. For y-axis, I used value_count() to get the values for each class. I colored the bars as green and red.

From the plot, we can see that the classes are almost balanced and we are good to proceed with data processing.

Data Processing

To work with categorical variables, we should break each categorical column into dummy columns with 1s and 0s.

Let’s say we have a column Gender, with values 1 for Male and 0 for Female. It needs to be converted into two columns with the value 1 where the column would be true and 0 where it will be false. Take a look at the Gist below.

# Original Columm

# | Gender |

# | 1 |

# | 0 |

# Dummy Columns

# | Gender_0 || Gender_1 |

# | 0 || 1 |

# | 1 || 0 |

To get this done, we use the get_dummies() method from pandas. Next, we need to scale the dataset for which we will use the StandardScaler. The fit_transform() method of the scaler scales the data and we update the columns.

dataset = pd.get_dummies(dataset, columns = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal'])

standardScaler = StandardScaler()

columns_to_scale = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']

dataset[columns_to_scale] = standardScaler.fit_transform(dataset[columns_to_scale])

The dataset is now ready. We can begin with training our models.

Machine Learning

In this project, I took 4 algorithms and varied their various parameters and compared the final models. I split the dataset into 67% training data and 33% testing data.

y = dataset['target']

X = dataset.drop(['target'], axis = 1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 0)

K Neighbors Classifier

This classifier looks for the classes of K nearest neighbors of a given data point and based on the majority class, it assigns a class to this data point. However, the number of neighbors can be varied. I varied them from 1 to 20 neighbors and calculated the test score in each case.

knn_scores = []

for k in range(1,21):

knn_classifier = KNeighborsClassifier(n_neighbors = k)

knn_classifier.fit(X_train, y_train)

knn_scores.append(knn_classifier.score(X_test, y_test))

Then, I plot a line graph of the number of neighbors and the test score achieved in each case.

plt.plot([k for k in range(1, 21)], knn_scores, color = 'red')

for i in range(1,21):

plt.text(i, knn_scores[i-1], (i, knn_scores[i-1]))

plt.xticks([i for i in range(1, 21)])

plt.xlabel('Number of Neighbors (K)')

plt.ylabel('Scores')

plt.title('K Neighbors Classifier scores for different K values')

Support Vector Classifier

This classifier aims at forming a hyperplane that can separate the classes as much as possible by adjusting the distance between the data points and the hyperplane. There are several kernels based on which the hyperplane is decided. I tried four kernels namely, linear, poly, rbf, and sigmoid.

svc_scores = []

kernels = ['linear', 'poly', 'rbf', 'sigmoid']

for i in range(len(kernels)):

svc_classifier = SVC(kernel = kernels[i])

svc_classifier.fit(X_train, y_train)

svc_scores.append(svc_classifier.score(X_test, y_test))

Once I had the scores for each, I used the rainbow method to select different colors for each bar and plot a bar graph of the scores achieved by each.

colors = rainbow(np.linspace(0, 1, len(kernels)))

plt.bar(kernels, svc_scores, color = colors)

for i in range(len(kernels)):

plt.text(i, svc_scores[i], svc_scores[i])

plt.xlabel('Kernels')

plt.ylabel('Scores')

plt.title('Support Vector Classifier scores for different kernels')

As can be seen from the plot above, the linear kernel performed the best for this dataset and achieved a score of 83%.

Decision Tree Classifier

This classifier creates a decision tree based on which, it assigns the class values to each data point. Here, we can vary the maximum number of features to be considered while creating the model. I range features from 1 to 30 (the total features in the dataset after dummy columns were added)

dt_scores = []

for i in range(1, len(X.columns) + 1):

dt_classifier = DecisionTreeClassifier(max_features = i, random_state = 0)

dt_classifier.fit(X_train, y_train)

dt_scores.append(dt_classifier.score(X_test, y_test))

Once we have the scores, we can then plot a line graph and see the effect of the number of features on the model scores.

plt.plot([i for i in range(1, len(X.columns) + 1)], dt_scores, color = 'green')

for i in range(1, len(X.columns) + 1):

plt.text(i, dt_scores[i-1], (i, dt_scores[i-1]))

plt.xticks([i for i in range(1, len(X.columns) + 1)])

plt.xlabel('Max features')

plt.ylabel('Scores')

plt.title('Decision Tree Classifier scores for different number of maximum features')

From the line graph above, we can clearly see that the maximum score is 79% and is achieved for maximum features being selected to be either 2, 4 or 18.

Random Forest Classifier

This classifier takes the concept of decision trees to the next level. It creates a forest of trees where each tree is formed by a random selection of features from the total features. Here, we can vary the number of trees that will be used to predict the class. I calculate test scores over 10, 100, 200, 500 and 1000 trees.

rf_scores = []

estimators = [10, 100, 200, 500, 1000]

for i in estimators:

rf_classifier = RandomForestClassifier(n_estimators = i, random_state = 0)

rf_classifier.fit(X_train, y_train)

rf_scores.append(rf_classifier.score(X_test, y_test))

Next, I plot these scores across a bar graph to see which gave the best results. You may notice that I did not directly set the X values as the array [10, 100, 200, 500, 1000]. It will show a continuous plot from 10 to 1000, which would be impossible to decipher. So, to solve this issue, I first used the X values as [1, 2, 3, 4, 5]. Then, I renamed them using xticks.

colors = rainbow(np.linspace(0, 1, len(estimators)))

plt.bar([i for i in range(len(estimators))], rf_scores, color = colors, width = 0.8)

for i in range(len(estimators)):

plt.text(i, rf_scores[i], rf_scores[i])

plt.xticks(ticks = [i for i in range(len(estimators))], labels = [str(estimator) for estimator in estimators])

plt.xlabel('Number of estimators')

plt.ylabel('Scores')

plt.title('Random Forest Classifier scores for different number of estimators')

Conclusion

The project involved analysis of the heart disease patient dataset with proper data processing. Then, 4 models were trained and tested with maximum scores as follows:

K Neighbors Classifier: 87%

Support Vector Classifier: 83%

Decision Tree Classifier: 79%

Random Forest Classifier: 84%

K Neighbors Classifier scored the best score of 87% with 8 neighbors.

Note : Find the best solution for electronics components and technical projects ideas
keep in touch with our social media links as mentioned below
Mifratech websites : https://www.mifratech.com/public/
Mifratech facebook : https://www.facebook.com/mifratech.lab
mifratech instagram : https://www.instagram.com/mifratech/
mifratech twitter account : https://twitter.com/mifratech

Back

Popular Coures

Predicting presence of Heart Diseases using Machine Learning

In this article, I’ll discuss a project where I worked on predicting potential Heart Diseases in people using Machine Learning algorithms

Import libraries

Information

Customer Service

Extra

My Account

Help & Support

Connect Us