Abstract:
The progression of HIV-1 among individuals receiving antiretroviral therapy (ART)
remains highly variable, driven by a complex interplay of viral genetic diversity
and patient-specific clinical factors. This study presents a machine learning framework
that integrates HIV-1 genomic features—specifically 8-mer nucleotide patterns
derived from the protease (PR) and reverse transcriptase (RT) regions—with
clinical laboratory markers such as CD4 count and viral load to predict disease
progression outcomes in ART-treated patients. Using a curated dataset,
we implemented a comprehensive preprocessing pipeline involving k-mer extraction,
CountVectorizer-based vectorization, SMOTE for addressing class imbalance,
standard scaling, and Principal Component Analysis (PCA) for dimensionality
reduction. Six machine learning models—Support Vector Machine (SVM),
Random Forest, Logistic Regression, XGBoost, K-Nearest Neighbors, and Multi-
Layer Perceptron—were systematically evaluated across 216 configurations. After
extensive hyperparameter tuning, the SVM model combined with SMOTE and
PCA consistently outperformed other models. Notably, it achieved an F1-score
of 0.96—selected as the primary evaluation metric due to the dataset’s original
imbalance—alongside high scores in accuracy (0.96), precision (0.98), recall (0.95),
and AUC-ROC (0.99). These results highlight SVM’s robustness and suitability
for high-dimensional genomic classification tasks. To enhance model transparency,
LIME was used to identify influential k-mer features contributing to each prediction.
These patterns may correspond to biologically meaningful mutations linked
to ART resistance or viral fitness. The final, tuned SVM model was deployed
in thrHIVe, a web-based application designed to deliver real-time predictions and
explainable insights for clinicians and researchers. This study showcase the potential
of integrating genomic and clinical data with interpretable machine learning
to advance precision HIV care.