Improving Heart Disease Prediction Using Random Forest and AdaBoost Algorithms

— heart disease is a major cause of death worldwide. Thus, diagnosis and prediction of heart disease remain mandatory. Clinical decision support systems based on machine learning techniques have become the primary tool to assist clinicians and contribute to the automated diagnosis. This paper aims to predict heart disease using Random Forest algorithm enhanced with the boosting algorithm AdaBoost. The model is trained and tested on University of California Irvine (UCI) Cleveland and Statlog heart disease datasets using the most relevant features 14 attributes. The result shows that Random Forest algorithm combined with AdaBoost algorithm achieved higher accuracy than applying only Radom Forest algorithm, 96.16%, 95.98%, respectively. We compare our suggested model to report machine learning classifiers. Indeed, the obtained result is supporting the efficiency and validity of our model. Besides, the proposed model achieved high accuracy compared to existing studies in the literature that confirmed that a clinical decision support system could be used to predict heart disease based on machine learning algorithms.


Introduction
Technological innovations contribute to empowering, enriching, and significantly transforming the health work methods. Indeed, artificial intelligence and machine learning (ML) can analyze, improve diagnosis, predict, and help daily clinical practice [1]- [3]. Hence, healthcare industries are competing to produce machine learning for medical decision support systems for disease prediction. Indeed, a predictive analytic model is used to assist clinicians to make more accurate predictions based on the volume of information gathered through a clinical data; such as data from past treatment and medical research results [4]. These models can be prospectively installed within the clinical settings to investigate whenever patients risk developing diseases. Prediction models using ML have been suggested by various studies to diagnose different diseases such as lung cancer, Liver disease, breast cancer, obesity, Parkinson, Alzheimer, and cardiovascular diseases (CVDs) [5]- [12]. CVDs are the top cause of death worldwide, with more than 17.9 million people died in 2016 according to the world health organization (WHO), and represented 31% of all global deaths. Accordingly, CVDs in Morocco represented 38% of total causes of deaths in 2016 ( Figure 1) [13]. This high mortality rate has attracted significant attention during the last years to improve and automate CVD diagnosis, resulting in numerous approaches. Clinical relevance Cardiovascular disease refers to all kinds of diseases that affect the heart or blood vessels; one of these types is heart disease (HD). HD is an umbrella term for various conditions that affect the heart's structure and function. Whereas all heart diseases are cardiovascular diseases, but not all cardiovascular diseases are heart disease. The most familiar type of heart disease is coronary heart disease. It is often referred to simply as heart disease, yet not the only kind of heart disease. Coronary heart disease is mainly affecting the coronary arteries and weakening the heart muscle when plaque (a combination of cholesterol, fat, calcium) occurs in the walls of the artery. The plaque reduces the volume of oxygen-rich blood getting to the heart, which can cause chest pain and block blood flow, leading to the most common cause of a heart attack [14], [15]. Various risk factors expand heart disease incidence; some are uncontrollable risk factors, including sex, age, and heart disease history. Several risk factors can be controlled: high cholesterol, high blood pressure, obesity, physical inactivity, uncontrolled diabetes, stress, depression, anger, smoking, alcohol use, and low diet [16]. Over time, these risk factors might cause changes in the heart and blood vessels yielding heart attacks, heart failure, and stroke [17]. Thus, it is critical to consider risk factors in early life to prevent and predict HD, which remains challenging. In the current clinical routine, angiography is the standard gold technique used to diagnose heart disease. However, as a conventional invasive-based method, angiography has some downsides such as complications and risks (including the dangers of radiation). Also, it is a costly and time-consuming assessment. To overcome the use of conventional invasive-based methods for the diagnosis of heart disease, researchers attempted to develop different non-invasive automated systems based on predictive machine learning techniques. Machine learning based approaches have been presented by numerous researchers to predict heart disease. Author [18] suggested a predictive model using C4.5 and fast decision tree algorithms applied on the four collected and separated UCI datasets; this model achieved an accuracy of 78.06% and 75.48% for C4.5 and fast decision tree, respectively, using only the Cleveland dataset. Author [19] predicted HD using the meta-algorithm AdaBoost on the Cleveland dataset and suggested reducing the number of attributes from 76 to 28 to provide higher accuracy of 80.14%. A comparative study using four different classifiers, including SVM, KNN, C5.0, and Neural network, was approved by Ref. [20] he achieved a high accuracy of 93.02% by C5.0 algorithm but only using Statlog dataset and validated with train-test split method. Author [21] gained 86.70% in accuracy using decision rules algorithm and cross validation technique. Nevertheless, traditional ML algorithms present some limitations in term of accuracy improvement. Thus, several other researchers have employed different approaches including the hybridization and ensemble learning to improve the performance of the prediction of HD. Author [22] Combined Infinit Latent feature selection method with SVM classifier and achieved an accuracy of 89.93% using three datasets, including Cleveland, Hungarian, and Switzerland, with 58 attributes. Author [23] used 'Z-Alizahed sani' dataset to develop a hybrid method by enhancing Neural network's performance using Genetic algorithm, and it yielded an accuracy of 93%. Author [24] developed a clinical decision support system for the accurate prediction of HF, using a hybrid approach based on both ANN and Fuzzy AHP. The result achieved an average prediction accuracy of 91.10%, which is 4.40% higher in comparison to that of the conventional ANN method. and author [25] proposed optimized stacked SVM algorithms. Author [26] implemented a homogeneous ensemble learning method that involves randomly partitioning the dataset into smaller subsets using a mean based splitting method, classification and regression tree (CART) was applied to model each partition. A homogeneous ensemble was then created using an accuracy based weighted aging classifier ensemble. The proposed method achieved an accuracy of 93% on the Cleveland dataset and an accuracy of 91% Framingham test set. yet the validation was carried out using only the train-test split method, which may cause overfitting. Author [27] trained and tested the optimized XG Boost approach only on the Cleveland dataset and achieved 91.80% accuracy using the cross-validation method. Author [28] contributed to a clinical decision support system for the improvement of the prediction of HF. The proposed model is introduced with different combinations of features and several known classification techniques. They combined random forest with a linear model (HRFLM) to enhanced performance level with an accuracy level of 88.7%.
The current work's contribution is to introduce a clinical decision support system to predict the risk level of HD using the patient's data set. A predictive model is proposed to detect patterns in existing HD patients' data, the classification approach (RFAB) combining Random Forest (RF) and AdaBoost algorithms has been proposed to predict HD. The effectiveness of the RFAB method is proved evidently by comparison with other studies and other algorithms including Naïve Bayes (NB), Decision Tree C4.5 (DT), Support Vector Machine (SVM), and RF to demonstrate the performance of the selected classifications algorithms to classify the HD data best. In the next section, our methodology is described with a brief detail of datasets. Section 4 presents the experimental results achieved. The different representations of outcomes are discussed in section 5. Finally, conclusions are given in section 6.

3
Proposed model The studied model was based on available online datasets. And the proposed architecture of our system to predict the presence of HD is shown in Figure 2. As soon as the preprocessing was carried out in the initial step of the process, the Random Forest algorithm is used as a classifier to predict whether a patient has HD or not, followed by the AdaBoost boosting algorithm to improve the efficiency of the RF classifier. To evaluate the performance and validate the proposed approach, we used the confusion matrix that measures the accuracy. Finally, we implement different comparisons with other studies that existed in the literature.

Data collection
The Heart disease datasets that have been used for the experiments are Cleveland and Statlog datasets [29], [30]. Both datasets were retrieved from the machine learning repository database of the University of California, Irvine (UCI), Cleveland was collected from V.A. Medical Center, Long Beach, and Cleveland Clinic Foundation. The principal responsible for data collection was 'Roberto Detrano'.
Cleveland dataset comprised 303 instances and contained 76 raw attributes. We used 14 reliable features, including the predicted attribute that was assigned 0 for the absence of HD and 1 to 4 for the ascending risk levels of HD. Patient's sensitive information like name and SSN were all removed for confidentiality purposes. Statlog project heart disease dataset also consists of 14 features and 270 of total instances, the distribution and percentage of absence and presence of HD in both datasets is shown in Table 1. In Table 2, we present 14 attributes applied for modeling with their description and their values. Also, the histogram of all features is shown in Figure 3.

Data preprocessing
Data preprocessing involves transforming data to an accurate and understandable format to improve the model efficiency and accuracy. Medical data are usually fuzz, incomplete, deficient attribute values, and contain irrelevant data [31]. In our research, the Statlog dataset has no missing values. However, the Cleveland dataset is characterized with six missing values, including four missing values for Number of major vessels (Ca) attribute and two missing values for Heart rate (Thal) attribute, even though both attributes had less than 5% missing values. We believe that imputing values would improve the data rather than removing them. To impute them, we use the "Mode" imputation method that replaced missing values with the most frequently occurring value since missing values are categorical [32].
The original datasets contain five output values; a value 0 indicated the absence of HD, and values between 1 and 4 showed different HD levels, respectively. In this study, we are interested in the presence or absence of HD. Thus, the output attribute is reclassified into a binary value of 0 or 1, indicating HD's absence or presence in the patients, respectively.

Classification method
Predictive analytic and machine learning go side by side, as predictive models generally embrace a machine learning algorithm as a machine learning algorithm that enables data-driven models to learn information from observed data in a training dataset to make the intended predictions [33]. There are two subtypes of predictive models: Regression and classification. Regression models analyze the relationships between variables to make predictions about continuous variables, while the classification assigns discrete class labels to particular observations as outcomes of a prediction [33]. Classification tasks can be organized into two main sub-categories: Supervised and unsupervised learning [4]. Our study has a specific target used to predict output for new use cases, which determines either the heart disease is present or not. Hence a supervised learning classification algorithm would be ultimate to train our data using Random Forest and AdaBoost algorithms.
Random Forest is a fusion of tree predictors that constructs multiple decision trees at training time and produces the class by voting on individual trees. It is similar to the decision tree algorithms concept, yet the algorithm builds a forest of decision trees with locations of attributes chosen at random. Its main advantage is improving the prediction accuracy without increasing the computational cost [34].
Boosting is a common approach for learning classifiers using an optimally weighted majority vote of weak classifiers to generate a robust classifier [35]. The most known boosting algorithm is AdaBoost introduced by Yoav Freund and Robert Schapire [36] as a meta greedy algorithm that builds up a powerful classifier by optimizing the weights and adding one weak classifier at a time.
The AdaBoost algorithm aims to maintain a set of weights over the training set. The weight on training example i on round k is denoted Sk (i) . Initially, all weights are initialized relatively, but on each round, the weights of misclassified samples are increased so that the weak learner is forced to focus on the hard examples in the trading set [37].
We propose a combination of Adaptive Boosting (AdaBoost) with RF as a base decision tree. As mentioned above, AdaBoost is a meta-algorithm that can be used in conjunction with many other learning algorithms to improve their performance and flexibility. It works by repeatedly running a given weak or base learning algorithm on diverse groups on the training data, afterward combining the weak learner's classifiers into a single composite classifier [38]. We will refer to the proposed learning method as RFAB. Steps of the hybrid AdaBoost and random forests algorithm are given in Figure 4.

The performance measurement
In order to measure the performance of the proposed classification algorithms, various evaluation measures have been implemented, including sensitivity, specificity, accuracy, recall (proportion of instances classified as a given class divided by the actual total in that class), MCC (a measure of the quality of binary classifications) and ROC curve. All these measures are calculated based on the confusion matrix described in Table 3. The confusion matrix, True Negative (TN) signifies that the model correctly classifies a healthy person. True Positive (TP) represents that person having heart disease is correctly classified by the model. False Positive (FP) shows that the model misclassifies a healthy person. False Negative (FN) notifies that patient having heart disease is incorrectly classified by the model. Specificity measures the ratio of negatives that are correctly classified, sensitivity measures the ratio of real positives that are correctly identified, and the accuracy of the classification model shows the overall performance of the model and can be calculated [39]. The following formulas can mathematically represent these measures:

Results
The model is also validated using the K-Fold cross-validation (CV) method, which is the most convenient method to avoid overfitting and get more accurate results in the testing set. In K-Fold CV, the whole dataset is split into k equal parts. The (k-1) parts are utilized for training, and the rest is used for the testing at each iteration. This process continues for k-iteration; in this study, k = 10 is used for experimental work since it produces significant results.
The results are estimated using confusion matrix measurements to evaluate the proposed model's performance; Table 4 shows results achieved with both the RF algorithm and the hybrid model RFAB combining RF with AdaBoost algorithm, using a combined dataset, and illustrated in Figure 5. RF outperforms with 95.98% accuracy. However, the implementation of the hybrid method RFAB has enhanced the accuracy by 0.18 to achieve 96.16%. We are also confirming the proposed model's efficiency by illustrating the AUC of the ROC chart in Figure 6.  The comparison of performances of the proposed method with well-known ML algorithms including Naïve Bayes (NB) [40], C4.5 decision tree (DT) [41], [42], Support Vector Machine (SVM) [43], and Random forest (RF) [44]using both separated and combined datasets are listed in Table 5. NB achieved high accuracy of 82.17% on the Cleveland dataset, while on Statlog dataset, both NB and DT achieved significant accuracy of 79.25%. All the algorithms listed, including the proposed method, achieved better results on the combined dataset comparing to each individual dataset. Figure 7 demonstrates the accuracy comparison; the proposed RFAB achieves the best result performance with 96. 16.34% accuracy comparing to other classifiers such as NB: 87.60%, DT: 77.13%, and SVM: 94.58%.

Discussion
Several studies were carried on diagnosing and predicting heart disease using ML techniques are listed in Table 6. Most researches have used four datasets, namely Cleveland [29], Hungarian, Long Beach, and Switzerland [45]. These data were all collected from the UCI machine learning repository. Further studies have used the Cleveland dataset only since not lacking values [11], [13], [16]- [19]. In contrast, other datasets showed more than 90% of some attributes' missing values which might compromise the accuracy and the quality of results, e.g "thal"and "ca" attributes that shown to have high correlation with the output attribute. The number of selected attributes and standard features was ranging from 76 to 8, including the class attribute. Generally, the studies that used many attributes have applied feature selection to improve the relevance [19], [22]. Hence, we perform only 14 attributes, including (Age, Gender, Chest pain, blood pressure, etc.) that are relevant for the risk factors of HD diagnosis. Various prediction models were built using ML techniques. Nowadays, despite a substantial research output, no gold-standard model is available to predict HD. Hence, there is still a need for improvement. Our model applied a combination of Random Forest and Ada-Boost algorithm trained and tested in both Cleveland and Statlog datasets with 14 attributes obtained evidencing results with high accuracy of 96.16% using cross validation technique. The summary of obtained comparative results with various classifiers is shown in Figure 10. Our proposed method reached the highest accuracy compared to NB, DT, SVM, and RF. This result demonstrated an enhanced efficiency of the proposed hybrid method. Indeed, many parameters impact the construction of the HD prediction model; these included the dataset of choice, the number of attributes and the output class, the algorithm used, and the validation method. Therefore, comparing our proposed method with other studies should consider all mentioned parameters to have a measurable comparison. Nevertheless, the proposed method outperforms in terms of accuracy and validation technique compared to other studies.

Conclusion
In conclusion, we developed a clinical support system based on machine learning; we used a helpful clinical dataset for clinicians to produce an accurate and efficient diagnosis system. Indeed, the achieved results emphasize our model's validity with a high level of accuracy of 96.16% and we demonstrated the effectiveness of the proposed method using 10-fold cross-validation. Random forest along with AdaBoost had given a very promising result in predicting heart disease. As perspectives, we look forward for training our model on large datasets. Finally, a comparison with ensemble learning methods would be of interest.