Optimization of Random Forest for Health Data Classification Using PCA and K-Means SMOTE-ENN

Authors : Dadang Priyanto; Muhammad Innuddin; Hairani Hairani; Khairan Marzuki
article cite 1 Year 2025
source: Engineering Technology & Applied Science Research
Abstract

Health data classification is a significant challenge in the healthcare field, particularly due to the inherent characteristics of health data, which typically exhibit high dimensionality and imbalanced class distributions. These factors can complicate the training process of classification models and adversely affect their performance and accuracy. Consequently, a method is required to address data complexity and class imbalance, ensuring that the resulting information is both accurate and reliable. This study aims to improve the performance of the Random Forest (RF) classification model when processing health data by integrating two primary approaches: Principal Component Analysis (PCA) and K-Means SMOTE-ENN. PCA is instrumental in reducing data dimensions while extracting the most informative features, thus minimizing noise and reducing computational demands. Meanwhile, K-Means SMOTE-ENN serves to balance class distribution through a combination of clustering-based oversampling and Edited Nearest Neighbors-based data cleaning, effectively addressing the issue of overfitting caused by unrepresentative synthetic data. The RF classification model was chosen, recognized for its strong performance in managing data with high dimensions and complex variable interactions. Experimental results indicate that the joint application of PCA and K-Means SMOTE-ENN significantly enhances the model performance. In the Pima Indians Diabetes dataset, accuracy rose to 98.41%, and the Area Under Curve (AUC) value reached 98.33%. For the Heart Disease dataset, an accuracy of 97.56% and an AUC of 97.73% were achieved. Compared with previous methods, the proposed approach achieves 2.91% accuracy improvement with SMOTE and Stacking Ensemble on the Pima Indians Diabetes dataset and 6.26% accuracy improvement and 14.73% AUC improvement compared with XGBoost on the Heart Disease dataset. These results show that combining PCA with K-Means SMOTE-ENN significantly improves the performance of RF on imbalanced healthcare data.


Concepts :
Artificial Intelligence in Healthcare
Imbalanced Data Classification Techniques
Smart Systems and Machine Learning
article cite 1 Year 2025 source Engineering Technology & Applied Science Research
Citations by Year
YearCount
2025 1