dc.description.abstract |
Around the world, breast cancer remains to be the most frequent type of all cancers,
and the major cause of death in women worldwide. A major factor in why the
diagnosis of breast cancer through Fine Needle Aspiration results is still done after
manual review of doctors, is because of the lack of explainability by the traditional
black box machine learning models. This paper aims to incorporate a simple web
user interface, and explainibility through the LIME python package. The performance
of four machine learning models (K-Nearest Neighbors, Logistic Regression,
Random Forest, and Support Vector Machine) were compared by its metrics (accuracy,
precision, f1-score, and area-under-curve) produced when predicting breast
cancer diagnosis, and its applicability with the LIME python package. The four
models were utilized with the Breast Cancer Wisconsin Diagnostic Dataset with 10
different configurations a) only scaling applied, b) scaling then random oversampling,
c) scaling, random oversampling, then feature extraction, d) scaling then feature extraction,
e) scaling, feature extraction, then random oversampling. Configurations f-j
are similar configurations, except it does not include scaling. The results show that
in terms of metrics and applicability towards the LIME model, random forest with
random oversampling produced the best results. As such, random forest with random
oversampling was the model and configuration chosen to be applied towards the web
application. |
en_US |