Gene Microarray Cancer Classification using Correlation Based Feature Selection Algorithm and Rules Classifiers

— Gene microarray classification problems are considered a challenge task since the datasets contain few number of samples with high number of genes (features). The genes subset selection in microarray data play an important role for minimizing the computational load and solving classification problems. In this paper, the Correlation-based Feature Selection (CFS) algo-rithm is utilized in the feature selection process to reduce the dimensionality of data and finding a set of discriminatory genes. Then, the Decision Table, JRip, and OneR are employed for classification process. The proposed approach of gene selection and classification is tested on 11 microarray datasets and the performances of the filtered datasets are compared with the original datasets. The experimental results showed that CFS can effectively screen irrelevant, redundant, and noisy features. In addition, the results for all datasets proved that the proposed approach with a small number of genes can achieve high prediction accuracy and fast computational speed. Considering the average accuracy for all the analysis of microarray data, the JRip achieved the best result as compared to Decision Table, and OneR classifier. The proposed approach has a remarkable impact on the classification accuracy especially when the data is complicated with multiple classes and high number of genes.


Introduction
Cancer is considered as one of the dreadful diseases and diagnosis of cancer is very important in initial stage for its proper treatment [11]. Cancer data is a collection of thousands of genes and DNA microarray is used to determine the expression level of genes [21]. Microarray gene selection and classification is considered very challenges task since the datasets are large and abundant in noisy genes [12]. The main problem in microarray datasets arises from the fact that the genes greatly outnumber the sample observations [27]. Thus, feature selection methods is needed in microarray cancer datasets so as to select suitable feature set that makes the classifier more accurate and faster [7]. The main benefits of using feature selection is improved the performance of

Background
Nowadays, various kinds of machine learning and statistical approaches are used to classify tumour cells accurate such as support vector machines [15], k-nearest neighbor [30], and neural network techniques [24]. Also, several researchers thus hybridize the classification techniques with optimization algorithms for further enhancement of accuracy [1] [22].
Two-phase hybrid model is suggested in [23] for cancer classification, integrating Correlation-based Feature Selection (CFS) with improved-Binary Particle Swarm Optimization (iBPSO) using Naive-Bayes classifier as the only classifier.
The Pearson's Correlation Coefficient (PCC) [18] is used in combination with Binary Particle Swarm Optimization (BPSO) or Genetic Algorithm (GA) along with various classifiers for the selection and classification of high dimensional microarray data. It was noticed that the PCC filter showed a remarkable improvement in the classification accuracy when it was combined with BPSO or GA. Also, the results show that the BPSO is working faster and has better performance than GA.
Additionally, the Probabilistic Attribute-Value have used in [4] for Class Distinction (Pavicd) algorithm as a feature selection in microarray cancer datasets. The Pavicd algorithm works on the space of feature values instead of the features' space. Experiments show that Pavicd gets the best performance in terms of running time and classification accuracy when using Ripper-k and C4.5 as classifiers.
A Collaborative Representation (CR)-based classification with regularized least square was developed [31] to classify gene data. The CR codes a testing sample as a sparse linear combination of all training samples and then classifies the testing sample by evaluating which class leads to the minimum representation error. Experiments results on several diseases show that the CR-based algorithm achieves high classification accuracy and fast computational speed than the traditional classifiers, such as support vector machine algorithm.
The K-Nearest Neighbor (K-NN) classifier and feature selection using ANOVA test was developed based on MapReduce programming model [26]. The approach works in a distributed manner on scalable clusters. The algorithms are successfully implemented on Hadoop framework and comparative analysis is done using various microarray datasets.
A multi-test decision tree (MTDT) was applied for solving biological problems [10]. The application of several univariate tests in each non-terminal node of the decision tree is considered. Comparison results with eight classifiers show that MTDT has a statistically significantly higher accuracy than popular decision tree classifiers, and it was highly competitive with ensemble learning algorithms.
Moreover, the Support Vector Machine (SVM) classifier was applied on four microarray datasets [2]. The study analyzed two different kernels of SVM; radial kernel and linear kernels. The results showed that the SVM exceeded the performance and accuracy compared to K-nearest neighbor (KNN) and neural network (NN).

The Proposed Work
In this paper, 11 different high dimensional datasets are applied. The Correlationbased Feature Selection (CFS) with Greedy Stepwise search method is proposed for genes selection. Also, multiple classifiers are utilized to show the quality of each of them.

Datasets
The performance of three classifiers; Decision Table, JRip, and OneR are investigated using eleven (11) publicly available microarray datasets [37]. A brief overview of these datasets such as number of gene, number of instance, and number of class is summarized in Table 1.

Correlation based feature selection algorithm
A common procedure for choosing the foremost applicable characteristics within the dataset is to use correlation. Correlation is additional strictly remarked as Pearson's coefficient of correlation in statistics. We are able to compute the correlation among every characteristic and also the output variable and choose solely those characteristics that have a moderate-to-high positive or correlational statistics (close to -1 or 1) and drop those characteristics with a coffee correlation (value near zero).
Correlation-based feature selection (CFS) ranks characteristics in keeping with a heuristic analysis operate supported correlations [17]. The operate gauges subsets product of characteristic vectors, that are correlative with the category label, however freelance of every alternative.
The CFS methodology assumes and accepts that impertinent options show a coffee correlation with the category and thus ought to be unheeded by the algorithmic rule. On the opposite hand, excess options ought to be examined, as they're typically powerfully correlative with one or additional of the opposite attributes [36].

Classification model
In this paper, three classifiers are applied; Decision Table, JRip, and OneR. The choice of various classifiers is due to the fact that there is no any specific classifier to work perfectly for all datasets and not all classifiers work in the same way on a dataset.
• Decision Table constructs a decision table majority classifier [25]. It evaluates feature subsets using best-first search and can use cross-validation for evaluation. • JRip implements RIPPER, including heuristic global optimization of the rule set [35]. • OneR is the 1R classifier with one parameter: the minimum bucket size for discretization [20].

Experimental Design and Results Discussion
Initially, the Decision Table, JRip, and OneR classifiers were applied on the original datasets. Then, the all eleven datasets were filtered using Correlation-based Feature Selection (CFS) algorithm. After that, the filtered datasets were tested against the applied classifiers. This was done in order to compare the classification accuracy of the dataset with the one before filtration. In each dataset, experiment was performed in full training method and 2-folds to 10-folds cross validation.
The features of the datasets were filtered and the number of selected genes using CFS is tabulated in Table 2  The accuracy of the classifiers applied on the original and filtered datasets was evaluated as shown in Table 3. Results in bold indicate the best performed classifier for each specific dataset. The results show that generally the accuracy of the classifiers on the filtered dataset performed better results when compared with those applied directly on the original datasets. However, there are some cases with few classifiers in which the accuracy on the original dataset is same as filtered dataset.
In addition, the results show that the Decision  Table), and SRBCT (97.6% with JRip).

Average
Breast Cancer Decision  Table- Table-  Considering the datasets analysis in full training method and 2-folds to 10-folds cross validation. The results showed that the JRip exceeded the performance and accuracy compared to Decision Table and OneR. As example, Figure 1 shows the classification accuracy for the SRBCT dataset using the 3 classifiers in the ten different tests. The results in Figure 1 prove that JRip combined with CFS was the best functional method since it yielded better results than the other methods.  Table  4. The average accuracy of all 11 data was 78.2% (Decision Table), 82.4% (Decision Table-CFS), 80.7% (JRip), 86.0% (Jrip-CFS), 73.1% (OneR), and 75.3% (OneR-CFS). It is clear that the classifier's accuracy is improved after selection process using CFS. Also, we have clearly noticed that the average accuracy of JRip was better than Decision Table and OneR.

Conclusion
Usually, microarray data is characterized by noisiness as well as increased dimensionality. Therefore, selecting relevant genes is an imperative in microarray data analysis. In this paper, CFS is proposed to select the relevant features. Also, Decision Table, JRip, and OneR classifiers are proposed to classify the microarray data. The comparative analysis proved that the accuracy of all classifiers is improved using filtered datasets compared with their accuracy on the original datasets. This indicates that the feature selection by CFS not only improved the efficiency of the classification process but also its accuracy is enhanced. Furthermore, it can be seen that JRip has presented the highest classification accuracy among all the other classifiers. Further, this paper can be extended by considering the applicability of another features selection techniques such as Genetic Algorithm, Principle Component Analysis, Simulated Annealing, Ant Colony Optimization, and Particle Swarm Optimization.

7 Authors
Mohammad Subhi Al-Batah obtained his PhD in Computer Science/ Artificial Intelligence from University Science Malaysia in 2009. He is currently lecturing at the Faculty of Sciences and Information Technology, Jadara University in Jordan. In addition, in 2018, he is working as the Director of Academic Development and Quality Assurance Center in Jadara University. His research interests include image processing, Artificial Intelligence, real time classification and software engineering (albatah@jadara.edu.jo) Saleh Ali K. Alomari obtained his MSc and PhD in Computer Science from Universiti Sains Malaysia (USM), Pulau Penang, Malaysia in 2008 and 2013 respectively. He is a lecturer at the faculty of Science and Information Technology, Jadara Univer-