Abstract:
Diabetes is a condition in one’s blood glucose that affects millions of people every day,
and there exists no easier way in predicting this aside from using machine learning
algorithms. These algorithms also lack the ability to explain their prediction results,
thus being dubbed as “black box” models. Aside from these, past researches have not
studied the effect of different changes that occur during the data selection process in
the performance of machine learning models in diabetes prediction. The study aims
to assess the effects of implementing SMOTE and feature selection techniques on the
predictive power of machine learning models. Moreover, the study aims to create a
web-based application that would function as a decision support tool to predict the
risk of patients developing diabetes. The web application is also integrated with explainable
artificial intelligence in order to highlight the characteristics leading towards
the prediction decision of the machine learning model. Performance metrics of five
supervised machine learning algorithms (Na¨ıve Bayes, Logistic Regression, K-Nearest
Neighbors, Random Forests, and Support Vector Machines) were tested on a blood
glucose dataset. Results show that implementing SMOTE provided an increase in values
of the performance metrics (accuracy, AUROC score, precision, recall, f1 score),
and that the best performing model was the Random Forest classifier with the implementation
of SMOTE, heightening the use of machine learning algorithms in clinical
practice for increased healthcare. The resulting web application was implemented
with LIME, a Python library used for explainable artificial intelligence.