Paper —Machine Learning Architecture for Heart Disease Detection: A Case Study in Iraq Machine Learning Architecture for Heart Disease Detection: A Case Study in Iraq

— In recent years, the amount of data has been increased dramatically, driven by many real-world fields such as marketing, learning, social media, multimedia, medicine…etc. Because of that, data mining algorithms have extensively used on these data to serve as one of the newest data modeling and analytical tools, by which, a knowledge-rich environment can be generated and decision-making can be improved. Data mining tools can be employed for reducing these tests and predicting future trends by valuable information-driven decisions. There are two categories of data mining algorithms: descriptive and predictive. The rules of clustering, association, summarization, and sequence discovery will be associated with descriptive type. On the other hand, predictive type will compromise classification, regression and time series analysis rules. In this paper, a study have been presented for helping specialists and physicians in Iraq to investigate heart problems via (Weka 3.8.3) software focusing on four data mining classification techniques (1BK, J48, Naïve Bayes and REPTREE). The predictive precision tests, the ROC curve, and the AUC value are calculated based on a compiled dataset that have been received from the hospital of Baghdad medical city and the hospital of Ibn al-Bitar. The performance of the J48 technique (94.5%) indicates optimum performance based on SMO no performance factor of Baghdad medical city.


Introduction
Heart problems diseases remain the principal cause of death worldwide, but earlier detection will help in preventing heart attack and stroke from occurring. Ideally, some of symptoms have not been taken into account, which later cause people to lose their lives. Hence, physicians need to estimate the timing of the onset of the diseases before they occur in their patients. The healthcare industry produce voluminous of clinical data, unfortunately, are not well exploited to extract hidden valuable knowledgemaking decision effectively [1]. Many of factors may become the reason of heart diseases for examples: high blood pressure, bad cholesterol level in the blood, unhealthy foods, lack of physical activities, smoking, noxious drug …etc, these causes as risk factors for predicting heart disease [2]. In fact, there are two main elements for controlling heart disease: a healthy life style and timely diagnosis. Regular check-ups (such as chest X-rays, angiography, echocardiography...) plays a very important role in the diagnosis and early prevention of heart diseases. However, such tests will consume both time and money. Adopting data mining techniques will help in reducing tests and diagnosing heart diseases early and timely. Different data mining techniques can be applied for discovering relationships and hidden trends in the data. Furthermore, these data is predominantly will be used for decision support, forecasting and assessment.
The data miming is accurately systems started by extracting data, then transforming and loading it onto a data warehouse system. Afterword, the data will be stored and managed in a multidimensional database system, then, information technology professionals and business analysts for analyzing by application software will access the stored data. Finally, the result will be presented in a specific format like a graph or table. The efforts of data mining can be classified to either descriptive or predictive. With Descriptive model the general properties of the data in database will be characterized (the data will speak about itself). While with the predictive model, the stored data will be inferenced for making future predictions [3]. There are many of data mining techniques in predictive and descriptive models such as classification, association rules, clustering, regression summarizations etc. [4]. Both prediction and classification are types of data processing which can be used to generate concepts that define relevant data classes or predicting possible data patterns [3]. In this paper, we are working to identify the ideal data mining algorithms for forecasting cardiac disease that are both computationally efficient and accurate. Weka software was used to apply 1BK, J48, Nave Bayes, and REPTREE technologies. A slandered data set had been gotten from two of hospitals in Iraq named (Ibn Al-Bitar) and (Baghdad Medical City). The rest of the paper is organized, as follows. The paper will started by addressing the literature review and related works. Afterwards, the methodology experiment's approach will be described in details. Following that, the experimental results and comparisons will be denoted. Finally, the conclusions are given in the last section.

Literature review and related works
Many attempts have been made by researchers to look at the use of data mining algorithms to assist healthcare providers in diagnosing cardiac disease. These techniques can be machine learning [5], Markov Chain Models, Genetic Algorithms, Reinforcement Learning, Decision Trees, and so forth are all examples of statistics.
Mai et al. [6] identified the gaps in researches between Cardiovascular diagnosis and curative. Also, they proposed a model to reduce those gaps systematically by applying single and hybrid data mining techniques to give appropriate heart disease medication, then, they applied the same techniques for diagnosing heart diseases. Generally, The proposed model aimed to answer some of questions such as: If these single and hybrid strategies assist healthcare practitioners in finding appropriate therapy for individuals with heart disease? And, in terms of precision, which one is superior? While Vikas and Saurabh [7] worked on predicting the presence of heart disease accurately with reducing the number of attributes (originally were thirteen attributes). These attributes had been reduced to 11 attributes using different methods of classifier techniques (Naive Bayes, J48 Decision Tree and Bagging) for predicting the diagnosis of heart disease without losing efficacy and precision to. In this study, 10-fold cross validation method was used for measuring the unbiased estimate of mentioned prediction models. The results of this research show that bagging algorithm accuracy of 85.03%. However, the authors in [8] implemented hybrid system for heart disease using data from 50 individuals' risk variables. Two data approaches were used in the study: artificial neural and evolutionary computation. For initialization of neural network weights, the global optimization advantage of genetic algorithm was applied. In comparison to back-propagation algorithm, they found that the learning was more consistent, rapid, and precise. The final result indicated that the training accuracy was 96.2 percent, while the validation accuracy was 89 percent, as shown in Table IV. In [4], Vijayarani sought to determine whether of two classification algorithms (SVM) or Nave Bayes classifier) was the best based on two factors: classification accuracy and processing time performance. In comparison to the Naive Bayes classifier algorithm, the SVM classifier performed the best, according to the paper. However, the data had been classifies faster within Naïve Bayes classifier algorithm with minimum execution time. Umair et al. [9] they attempted to extract useful patterns from heart patients' information. As a basis, they used Weka machine learning software with three classification algorithm: Decision Tree, Naïve Bayes and Neural Network. 597 heart patients' data had been gotten online from the UCI repository. Furthermore, different metrics of performance were considered: Accuracy, TP rat, ROC curve value, precision, FP rate and, F-measure. Furthermore, for increasing accuracy and decreasing the time execution and complexity, two different scenarios were presented, 14 attributes had been satisfied with the first scenario, while 8 selected attributes were used in the second scenario, the data set was in ARFF format supported by Weka. They concluded that Naïve Bayes classification algorithms had the highest accuracy which approximately reached to 82.914%. For more than 1000 patients' records, [10] proposed a strategy for making diagnostic choices and determining the associated risks of clinical populations at a premature time. For identifying the risk level of patients, frequent item sets had been extracted depending on the symptoms indicated and the minimal support rating. The prediction results indicated that generation method of frequent item sets is better than existing methods. In addition, for diagnosing Chronic Obstructive Pulmonary Disease (COPD) in India, Shaila and Anupamma in [11] adopting DecisionTree technique for that purpose. The proposed model had been presented for Indian e-health system which contains patient's details. The "aadhaar" number had been used as a reference to unique "issued by the government of India as an upcoming identification proof which used to follow the treatments given to each patient in different hospitals".
The authors in [12], paper aimed to evaluate the accuracy of several data analysis categorisation algorithms. Cleveland data collection for cardiac disorders, which had 303 occurrences, was used as the core database for testing and training the proposed system. ten -fold cross was used with the following classifiers to enhance the amount of data: Na¨ıve Bayes (NB), Decision Tree (DT), K-Nearest Neighbor (K-NN), Multilayer Perceptron (MLP), Radial Basis Function (RBF), Support Vector Machine (SVM) and Single Conjunctive Rule Learner (SCRL). Moreover, bagging, boosting and stacking used as ensemble prediction of classifiers. The results concluded that the SVM had the highest accuracy with 84.15%, while the SCRL had the lowest accuracy with 69.96%. Furthermore, after bagging technique was applied, the accuracy percentage stilled the best, while DT is the worst with 78.54%. However, with boosting technique, SVM won the highest percentage of accuracy 84.81%. Finally, stacking technique confirmed MLP and SVM were the most accurate, with an accuracy of 84.15 percent. In [13] application software had been produced for medical experts to predict the occurrence or recurrence of non-communicable diseases (NCDs) based on predictive data-mining model. For examining the proposed software application, patient's records obtained from a hospital named (Bahrain Defense Force). The proposed application was executed and tested in the mentioned hospital by the actual physicians. It showed that the prediction system can be adopted for predicting NCDs' diseases instantly and effectively. Furthermore, Sayali and Rashmi used both the Nave Bayes and the KNN algorithms to predict cardiac illnesses based on the datasets in [14]. The results explained that the NB accuracy was 82% is more than KNN. In addition, the authors proposed a method for illness risk prediction using a deep convolution network-based uni-model diabetes risk predictive model as an extension of their work (CNN-UDRP). With the help of structured, The CNN-UDRP algorithm has a predictive performance of above 65 percent. However, there are many of objectives addressed by the authors in [15], they aimed to analyze different data mining techniques that used for classification and different classification methods for imbalanced data. Furthermore, for prediction of heart diseases, a new classification method had been designed. In adiition, the proposed classification method had been analyzed with different data sets. Finally, the performance of the existing classification techniques had been compared with the proposed prediction method in terms of performance and AUC. While Pranav et al [16] used five machine learning algorithms to predict the possibility of having heart disease (Random Forest, , Support Vector Machine, Naïve Bayes, Logistic Model Tree (LMT), and Hoeffding Decision Tree). For training and testing the model, Cleveland dataset had been applied. Each algorithm had been analyzed and the result compared by means of accuracy. The authors fund that the random forest is the best among the other algorithm because it gave the maximum accuracy.

Methodology
The main goal of this research is to arrange a dataset for thepatient to effectively predict possible heart attacks. To identify the relation of cardiovascular disease in respect of specific attributes, a model was constructed using prediction techniques. Data mining is being used in the study to create models based on the selected class prediction attributes. Because of its knowledge in pattern identification, analysis, and prediction [17], WEKA (Waikato Environment for Knowledge Analysis) was utilized to make predictions.The Waikato Environment for Knowledge Analysis (WEKA) is an open source software and machine learning toolkit developed by the University of Waikato , New Zealand, for knowledge research. The components of this program are data pre-processing methods, classification / regression algorithms, clustering algorithms, relationship rules-finding algorithms, and 15 attribute / subset evaluators + 10 functions selection search algorithms. While the key GUIs are "Explorer" (analysis of exploratory data), "Experimenter" (experimental environment) and "Information Flow" (interface influenced by the modern process model).
The standard dataset compiled in this analysis includes 200 cases, which are obtained under the supervision of the Ministry of National Health from Iraq hospitals. A significant number of different observations must be examined during the diagnosis to identify the heart attack with high precision, typically the doctors will rely on all the recorded symptoms, the answers that patient gives to questions, as well as the physical evaluation and laboratory testing. UCI statistics provide ample medical conditions to identify heart failure, thus data obtained from Baghdad hospitals based on these elements Medical aspects such as (endurance training, ST stress produced by training compared to other parts, maximum exercise gradient ST section, and major fluoroscopic-colored vessels number) will be taken into account. As a result, cardiologists use these parameters instead of medical factors (Heart rate, family history of health, smoking, echo perkiness detection, and prior angina attack). In addition to the recorded ultrasound, modified medical conditions will be called risk factors, family history, likelihood of previous angina will be applied to achieve sufficient medical factors; these results included four forms of angina, in addition to a standard degree. Table 1 demonstrates the distribution of the 5 types of heart disease; it consists of 13 scientific conditions essential to diagnose heart disease. These variables will be converted into numerical representation to construct a diagnostic system [18].

Key performance indicators
The Indicators has been utilized for this study are described as follows:

Ƥrecision
This is the proportion of relevant instances among the retrieved examples. The precision equation is as follows [19]:

Recall
Represent the minor proportion of the correct cases. The recall formula is as follows: 3. Ғ-Ṃeasure Ғ-Ṃeasure equation is given in (3) from [20].

ŖOC Área
For each feasible cut off for a study or study design, Ŗoc Curves are extensively used for graphical display of the connection / trade off, including medical sensitivity and accuracy.

ƤRC Área
The list of patients who are not unwell but have bad test results will have no effect on precision-recall slopes When assessing and comparing to get the comprehensive picture, precision-recall graphs should be utilized as an addition to the frequently used ŖOC curves. The outcome classifier results [21], as indicated in Table 2.  Ťrue positive (ŤP1): The patients were accurately identified as being positive  True negative (ŤN1): The patients were accurately identified as being negative.
The system becomes a model example when P and N are near to 100 percent.  ŤN is Ťrue Ɲegative: People who are healthy are accurately identified as such.
 ҒN is Ғalse Ɲegative: Patients with cardiovascular disease were mistakenly classified as healthy [22].  ℭℭⅠ: It denotes the proportion of patients who have been appropriately diagnosed both those who require and those who do not require medical tests. Accuracy is another name for precision [23]:  ṂAE: A contrast of what was predicted and what happened in the end. 1-ACC can be used to calculate it. The mean absolute error of a successful basis is relatively modest [24].  Ķappa: The predictability consistency with the genuine class is measured by Kappa. It calculates the difference in forecasts based on the alignment detected vs the one expected by chance. The mathematical value of kappa is a value between one and ten (0-1). A value greater than 0 indicates that the classifier is better than chance [25].ŖMSE: The Root Mean Squared Error [26] is the difference between the estimated and actual value.
= fitness applicability target value for j.

Experimental results and performances comparisons
To prediction a heart disease, part of the dataset is used for training and remaining part is used for testing. The results of classifiers are listed in Table 3. Figure 1 shows the overall of the classifier visualization. Using a variety of machine learning methods, assess the percentage split results.   The algorithms are applied to the data set using stratified 10-fold validation to test the effectiveness of classifying strategies for classifier and calculation precision. The accuracy, sensitivity, and selectivity measures are calculated using the resultant uncertain model. The matrix shows which samples are designated as true and which are marked as false. An evaluation of the uncertainty equation reveals that J48, REPTREE, Bayes Net, and Random Forest provide a predictive model of 200 cases where cardiovascular disease risk factor is favorable. The techniques clearly suggest different data mining techniques, which can determine a type of diagnosis.
We made a contrast between our findings and those who served in the same area, which means applying same methods of WEKA on the heart disease but with different parameters, to know the effectiveness of our study. We discovered that our suggested model is superior than another models alone in terms of predicting heart disease and detection, as well as classification accuracy. Table 4 and Figure 2 shows our suggested framework for classifying WEKA with other observations from the study.

Conclusion
In this research article, we have introduced anEfficient Heart Disease Prediction Program using data mining and we use a classifier ensemble to analyze the accuracy of heart disease prediction. The heart data set obtained from Ibn al-Bitar hospital and the machine-learning repository in Baghdad medical city was used for training and testing purposes. This device will help medical practitioners make effective decisions based on a given parameter. The experiment was successfully conducted with several data mining classification techniques (1BK, J48, Naïve Bayes and REPTREE) with 10 fold and it is found that the J48 algorithm gives better output with the accuracy of (94.5 per cent) over the supplied data set. It is believed that data mining will contribute significantly to cardiac disease research and ultimately improve quality. It may also be applied using multiple classification methods.

Acknowledgment
The authors would like to thank Mustansiriyah University (www.uomustansiriyah. edu.iq) Baghdad-Iraq for its support in the present work.