Machine Learning

Capstone Project: Detection System for Non-Communicable Diseases(NCDs) Using Machine Learning

Image from Gebauer Company

Introduction and Problem

My love for working with data emanated from taking data mining and machine learning courses during my undergraduate studies at ALU. Concurrently, the amount of data in the world today is growing exponentially, and leveraging data science to build predictive models in solving real-world problems in the medical domain has grown to become the peak of my academic interests. As a path to my quest, my final capstone project was focused on using machine learning to detect early risks of non-communicable diseases (NCDs). NCDs include heart disease, cancers, diabetes, and respiratory diseases. These diseases result from unhealthy habits like tobacco, harmful use of alcohol, unhealthy diets, and physical inactivity. The failure to detect NCDs at an early stage is a challenge. The World health organization says the delay in the detection leads to further development of the disease. Diagnosis of NCDs late results in a complicated case that is expensive to treat. At the global level NCDs are responsible for 70% of deaths.

Project Aim: The project’s main objective was to develop a web-based system that uses data and supervised machine learning techniques to automate the early detection of non-communicable diseases early to ensure a prompt treatment to prevent or slow down the development of the disease.

System Design

Data Exploration: The datasets for heart disease and diabetes used for this project are available online in the UC Irvine machine learning repository. The diabetes dataset contains 768 data points collected using direct questionnaires from the patients of Sylhet Diabetes Hospital in Sylhet, Bangladesh, and approved by a doctor. Below is a sample of the diabetes data.

The heart disease dataset contains 303 data points. The dataset was created by the Hungarian Institute of Cardiology, Budapest, University Hospital, Zurich, Switzerland, University Hospital, Basel, Switzerland, and V.A Medical Center, Long Beach and Cleveland Clinic Foundation. Below is a screenshot of the heart disease data.

Data Preprocessing: Here, any categorical data is converted to numerical so that machine learning algorithms can understand it. This was for the diabetes data set, which had categorical values. Also, continuous variables such as age were normalized using the scikit-learn standard scaler function. Below are screenshots of the preprocessed data.

Feature Engineering: The xgboost classifier algorithm is used to identify essential features in the dataset, and the first eight essential features are selected using the xgboost classifier for training the model.

Data Splitting: The data is first separated into targets and features. Then, the preprocessed is split into training and test sets. The training set constitutes 80%, and the test set comprises 20% of the data set.

Model Development and Optimization: At this stage, the extracted features were used to train a deep learning model (Keras model). Two models were developed, one for diabetes prediction and another for heart disease prediction. For model Optimization, the dropout technique for regularizing Neural Network models is used. The dropout technique was proposed by  Srivastava et al. in their 2014 paper.  The Stochastic Gradient Descent optimization algorithm was used in optimizing the diabetes prediction model. The SGD estimates the error gradient for the model’s current state and then updates the model’s weights using a backward-propagation of errors algorithm. This process is referred to as backward propagation. The heart prediction model uses the Adam optimization algorithm( an extension of the SGD algorithm)  to update weights iteratively based on training data. Below are screenshots of the models developed for diabetes and heart disease prediction.

Diabetes Model
Heart disease Model

Model fitting: At this stage the models created above were fitted with the training dataand some hyper-parameter tuning was applied. Hyper-parameters determine how the neural network is trained and the network structure. Below are screenshots.

Model Evaluation: Using the deep learning Keras model, I build models for the detection of diabetes and heart disease. We achieved an impressive precision, recall, and area under the curve of 96%, 91%, 8% for the diabetes model & 92%, 84%, and 96% for the heart disease model.
Precision tells us how accurate the model is when it says an individual is at risk of disease.
Recall tells us how good our model is at predicting people who are actually at risk of disease.
Area Under the Curve measures the ability of our model to distinguish between someone at risk of disease and someone who is not at risk of disease. The higher the value, the better the model’s performance at distinguishing between an individual at risk of the disease and an individual who is not.

Model performances analysis: The performance of the Diabetes and Heart, disease detection system depends on the size of the data you have. We can see this from the accuracy of the two models. The model for diabetes detection has a higher accuracy of 94.23%, while that for heart disease detection has a lower accuracy of 88.52%. This is because the diabetes data set has 768 data points, while the heart disease data set has 303 data points. The graphs plotted above clearly show how the heart diabetes model with more data performs well in Accuracy, Precision, F1_score, loss, Recall, and Area Under the Curve.

Model Accuracy: The confusion matrix is used to help us see how confused our model is when making predictions.  It will summarize the number of correct and incorrect predictions with count values.

5.5 Summary of confusion Matrix

True Positives gives the number of individuals the model predicted to be at risk of disease, and they are at risk of the disease.

True Negatives give the number of individuals our model predicts not to be at risk of the disease, and they are indeed not at risk of the disease.

False Positives give the number of individuals the model predicts to be at risk of the disease, but they are not at risk of the disease.

False Negatives gives the number of individuals the model predicts not to be at risk of the disease, but they are at risk. 

Therefore, having a high False Negatives (FN) is considered more dangerous than having a high False Positives (FP) because of the rate of fatality, particularly in the medical domain.

Below is the confusion matrix for our diabetes and heart disease detection models.

Below is the confusion matrix for our diabetes and heart disease detection models.


Then, the best models were deployed to a web application using the streamlit library, which is a Python framework made for data scientists.

Project Impact: This project can directly impact our community by assisting medical practitioners in carrying out better diagnoses via data-driven decisions. Also, it can impact us as humans directly by ensuring that the NCD is discovered early enough and treatment to prevent or slow down development is administered.

Challenges and Future of Work
One limitation of this system is that the results rely on a good internet connection, which could be a problem for users living in low-bandwidth areas. As future work, it would be interesting to investigate the direction of offline disease detection systems where internet connectivity is not required. One major challenge was getting data from a specific group of people like Rwanda or any other African country, so we used data from an online repository. In addition, the size of the data was small. A small training data results in a poor approximation which can negatively affect the model’s performance.

Leave a Reply

Your email address will not be published. Required fields are marked *