Abstract
Health data classification is a significant challenge in the healthcare field, particularly due to the inherent characteristics of health data, which typically exhibit high dimensionality and imbalanced class distributions. These factors can complicate the training process of classification models and adversely affect their performance and accuracy. Consequently, a method is required to address data complexity and class imbalance, ensuring that the resulting information is both accurate and reliable. This study aims to improve the performance of the Random Forest (RF) classification model when processing health data by integrating two primary approaches: Principal Component Analysis (PCA) and K-Means SMOTE-ENN. PCA is instrumental in reducing data dimensions while extracting the most informative features, thus minimizing noise and reducing computational demands. Meanwhile, K-Means SMOTE-ENN serves to balance class distribution through a combination of clustering-based oversampling and Edited Nearest Neighbors-based data cleaning, effectively addressing the issue of overfitting caused by unrepresentative synthetic data. The RF classification model was chosen, recognized for its strong performance in managing data with high dimensions and complex variable interactions. Experimental results indicate that the joint application of PCA and K-Means SMOTE-ENN significantly enhances the model performance. In the Pima Indians Diabetes dataset, accuracy rose to 98.41%, and the Area Under Curve (AUC) value reached 98.33%. For the Heart Disease dataset, an accuracy of 97.56% and an AUC of 97.73% were achieved. Compared with previous methods, the proposed approach achieves 2.91% accuracy improvement with SMOTE and Stacking Ensemble on the Pima Indians Diabetes dataset and 6.26% accuracy improvement and 14.73% AUC improvement compared with XGBoost on the Heart Disease dataset. These results show that combining PCA with K-Means SMOTE-ENN significantly improves the performance of RF on imbalanced healthcare data.
Concepts :
Citations by Year
| Year | Count |
|---|---|
| 2025 | 1 |