A Hybrid Gene Selection Strategy Based on Fisher and Ant Colony Optimization Algorithm for Breast Cancer Classification

Morocco Abstract —Breast cancer poses the greatest threat to human life and especially to women's life. Despite the progress made in data mining technology in recent years, the ability to predict and diagnose such fatal diseases based on gene expression data still reveals a limited prediction performance, which may not be surprising since most of the genes in expression data are believed to be irrelevant or redundant. The dimensionality reduction process may be considered as a cru-cial step to analyze gene expression data, as it can reduce the high dimensionality of the breast cancer datasets, which may result into a better prediction performance of such diseases. The paper suggests a new hybrid approach-based gene selection that combines the filter method and the Ant Colony Optimization algorithm to find the smallest subset of informative genes (genes markers) among 24,481 genes. The proposed approach combines four machine learning algorithms - C5.0 Decision Tree, Support Vector Machines, K-Nearest Neighbors algorithm, and Random Forest Classifier - to classify each of the selected samples (patients) into two classes which have cancer or not. Compared with existing methods in the literature, experimental results indicate that our proposed gene selection approach achieved globally higher classification accuracies with a rel-atively smaller number of


Introduction
According to WHO, breast cancer is ranked second on the list of cancer-related deaths in women after lung cancer, affecting around two million women each year [1]. Diagnosis of breast cancer at an early stage may allow for adequate and effective treatment to be adopted, which may increase the survival rate for this disease. This fact puts in evidence a strong need to develop a prediction system that can detect breast cancer at an early stage, so that prompt treatment is started. With the development of microarray technology, gene expression analysis has become an effective tool in biomedical research since it enables to evaluate the expression levels of thousands of genes simultaneously, which has attracted a number of researchers' interest in prediction and diagnosis of different kinds of cancers [2]. However, using microarray technology to predict breast cancer is not without challenges, because the existence of a large number of genes against a small number of specimens may negatively influence the credibility of any prediction system. For this reason, and in order to improve breast cancer risk prediction performance, we proposed a new approach-based gene selection that combines two feature selection (FS) methods in a two-step hybrid system. The first step extracts most informative genes by using Fisher-score based filter method in order to reduce the search space, and then we use Ant Colony Optimization (ACO) based wrapper method to select the smallest subset of genes that allows the highest prediction performance. Our proposed hybrid approach is evaluated using four classifiers: C5.0 Decision Tree, Support Vector Machines (SVM), Random Forest (RF), and K-Nearest Neighbors (KNN) algorithm.
In the next section of this work, we briefly review the existing literature. Techniques and tools are described in the third section. The penultimate section is devoted to discussing experimental results. The final section summarizes the contribution of this study.

Existing Literature
With the development of Machine Learning (ML) techniques, several breast cancer prediction approaches are available. In this section, we will present the most recent common techniques carried out in this research area.
Aldryan et al. have developed a breast cancer risk prediction model that combines MBP (Modified Backpropagation with Conjugate Gradient Polak-Ribiere) classifier with Ant Colony Optimization based gene selection. They tested their approach on five public microarray datasets (Breast cancer, Colon Tumor, Leukemia, Ovarian Cancer, and Lung Cancer). For Breast cancer, the best accuracy of 64.12% was achieved by involving 2448 genes (10% of genes). However, the proposed work differs from ours as its result is based on a large number of genes with a low accuracy compared to our proposed approach [3].
Al-Quraishi et al. proposed a new breast cancer prediction approach based on an ensemble of Deep Neural Network (DNN) and Support Vector Machine (SVM). By combining the ensemble classifier DNN+SVM with the Correlation-based filter method (FCBF). Based on the holdout evaluation technique (80% training set and 20% test set), experiments achieved an accuracy of 96.11% using 112 genes. This proposed study differs from ours as we used a stratified k-fold cross-validation technique to evaluate the performance of our proposed models [4].
Kumari and Singh have designed a system that can predict breast cancer at an early stage based on a Wisconsin breast cancer database. The proposed system is a combination of FS using Correlation-Based Measures with classification using linear Regression (LR), SVM, and KNN algorithm. Experimental results show that the best results in terms of accuracy were achieved with KNN classifier. This research is different from our work as we predict breast cancer risk based on gene expression data [5].
Shen et al. have introduced a deep learning system detecting breast cancer on screening mammograms by using end-to-end training approach. This research focused on sensitivity and specificity as evaluation metrics, while in our study, we used accuracy to evaluate the quality of our models. Moreover, The proposed approach differs from ours as our prediction system is based on gene expression data [6].
Hajiabadi et al. have integrated a new objective function (a combination of three loss functions: Correntropy, Hinge, and Cross-Entropy) to a simple ANN architecture. They used precision, recall, f1-Score, and accuracy to evaluate the performance of the proposed objective function. However, the new method was evaluated by doing experiments on Wisconsin Breast Cancer Diagnosis (WBCD) dataset, which is not the case for our study [7].
In order to improve breast cancer risk prediction using gene expression data, Hamim et al. have proposed a new two-phase gene selection approach. First, they used Fisher score-based filter method to reduce the research space complexity. Then in the second phase, they used C5.0 Decision Tree algorithm to find the smallest subset of genes to predict breast cancer with high performance. The experiment results have shown that their prediction framework achieved a performance of 93.28% in term of accuracy by involving only five genes predictors [8].
To diagnosis breast cancer at its early stages, Rajamohana et al. have employed several ML algorithms such as random forests, decision trees, KNN, and SVM. The experiments were conducted on WBCD dataset, and results show that random forest gives a good result in predicting breast cancer with an accuracy of 93.34% [9].

3
The Proposed Framework Figure 1 summarizes the main steps of our proposed framework to improve breast cancer risk prediction. Using Fisher-score based filter method and ACOC5 based wrapper method; ACO is used to implement the gene selection, and C5.0 algorithm serves as a fitness function. C5.0, SVM, KNN and RF are used to classify the selected genes. http://www.i-joe.org Frequently, gene-expression data analysis refers to a large number of features (genes), p, versus a small number, k, of samples (patients) ( << ). In contrast, in traditional classification research, the number of genes is smaller than the number of samples. Given this fact, technically, performing classification using microarray data may appear to be time and resource consuming. Feature selection (Gene selection in the context of microarray data analysis) is a powerful tool because it allows us to significantly reduce our research space by selecting only relevant and informative features and removing irrelevant and redundant ones [10], which may improve computational speed and prediction accuracy. Various approaches for FS exist in the literature; the present paper proposes a combination of two FS approaches, the Fisher-score based filter method, and the ACOC5 based wrapper method. The proposed gene selection approach is called Hybrid Fisher ACOC5 (HFACOC5), which is illustrated in the flowchart shown in Figure 2. The overall pseudo-code of our framework is illustrated in Algorithm 1 -discussed in the following subsections: Filter method using Fisher score: As they act independently of the machine learning process, filters-based feature selection methods are faster than the wrapper methods; wherefore, these methods are more commonly used when it comes to dealing with a high dimensional data set [11]. Many filter techniques exist [12]- [14]; in the present work, the Fisher score denoted by F was applied to select the most relevant features (genes). As a supervised strategy used in binary prediction problem, Fisher score focuses on a subset of features (genes) for which the distances between data points from different labels should be the largest possible, whereas distances between the data points from the same labels should be reduced as much as possible [15]. Thus, the gene subset is determined in two steps: • At the first step, the score of each gene G i is computed using Equation (1) (1) , are the mean and standard deviation of k-th class, corresponding to the i-th gene. denotes the mean of the whole i-th feature in the X matrix.
• At the second step, all genes in breast cancer dataset are ranked by their importance, and the top 100 ranked genes with high scores are selected as the most informative.

ACOC5 based wrapper approach:
If filter methods evaluate the goodness of features independently of any classification process, the wrapper approaches, on the contrary, use the learning algorithm to evaluate the importance of feature subsets, the reason why they are very slow and computationally more intensive [16]. However, wrappers generally guarantee better FS results than filters in most of cases. In the proposed work, we explore our research space by using ACO search engines. Our motivation beneath this choice resides in the ability of ACO to efficiently scan the search space to find the optimal gene subset.
Inspired by the food searching system of real-life ants, Ant Colony Optimization is a popular metaheuristic algorithm introduced by Marco Dorigo in the early 1990s [17]. In nature, when a source of nourishment is found, the ants communicate between them to find the shortest path between the nourishment source and their nest. The communication process is done via a special chemical known as pheromone. So, when ants travel down to get the source of food, they deposit an amount of pheromone on the chosen path to cross. As the pheromone is a volatile chemical, the more ants deposit pheromone on a path, the more that path becomes more attractive for being followed, and the other paths become less attractive, and in this way, the optimal path is chosen [18].
As a powerful optimization technique used in many research areas [19], [20], ACO is a promising approach that has been widely employed in FS [19], [21]. In the context of FS, the main idea of ACO is to model the problem of selection as a problem of finding the optimal path in a graph, where nodes in the graph are features (genes), and edges between them represent the choice of the next features. Thus, searching for the optimal path in the graph is the synonym of finding the optimal feature subset in a features space. In the context of gene selection using ACO, we can reformulate our approach as follows: • Step 1: Initialize the parameters of ACO, such as the number of ants , the maximum number of iterations , the amount of pheromone in the search space, pheromone evaporation coefficient 0 ≤ ≤ 1, the heuristic desirability , and tunable parameters ( ≥ 0 decides the relative influence of pheromone, ≥ 0 controls the influence of , and a constant multiplier defines the amount of pheromones that should put each ant) • Step 2: For each iteration , each ant starts in a randomly selected feature, and to construct a candidate feature subset from it, ants are supposed to follow the probabilistic transition rule of Equation (2).
where, denotes features set (nodes) that have not been selected yet, ( ) the amount of pheromone trail on the edge (between nodes (features) and ), ( ) the heuristic desirability to visit feature (to select feature ) when the ant is in the feature .
• Step 3: Evaluate each candidate feature subset (constructed by each ant ) using the classifier C5.0 (described in the fourth section). The evaluation step is carried using Equation (3).
where K denotes the number of folds as we used the stratified K-fold cross-validation technique to evaluate the candidate feature subset.

Gene classification
After selecting the subset of the most informative genes, we classify data using the following algorithms: KNN, SVM, C5.0 decision tree, and Random Forest (RF). The prediction results of these classifiers are then used to evaluate the effectiveness of our proposed gene selection approach.
C5.0 decision tree: Based on decision trees, C5.0 is a new popular classification algorithm developed from C4.5 by [22]. Compared to its ancestor, C5.0 takes its reputation from many advantages: its ability to handle different kinds of data, dealing with missing values and outliers, its high speed and high classification performance, especially with the high-dimensional datasets, supporting boosting and cross-validation process, and automatically allowing removal of unhelpful features. To have better results in terms of performance, the maximum number of boosting trials is set to 100. Support vector machine: As a binary classifier algorithm, the SVM aims at finding the linear separation (hyperplane) between two classes of observations with the idea that the more the border between them is maximum, the more robust the classification [23]. However, in most real classification cases, datasets are often linearly non-separable, which may necessitate transforming the original space into a new space, and then a linear separation is constructed using the new space [24]. To handle the problem of non-linearity in the present work, the transformation using the Radial Basis Function (RBF) is used, in which the gamma value is set using the formula (1/ _ ).

K-Nearest Neighbors algorithm:
KNN is one of the simplest supervised ML algorithms used for pattern classification and regression. The KNN is recognized as being a non-parametric algorithm because it does not use any mathematical functions to predict labels for new observations; instead, the prediction process is based on the majority (for classification) or average (for regression) of the k nearest neighbors of the new observations (with > 0) in training set by using "feature similarity" [25]. The similarity process is defined using the distance metric between two observations. In the present work, as we have continuous features, we used Euclidean distance as the distance metric, and the number of neighbors, k is set to 4.
Random forest: Random Forest (RF) is a supervised ML algorithm used for pattern classification and regression. In the context of classification, RF is an ensemble of independent tree classifiers. Each tree classifier is constructed using randomly selected subset features. Thus, new observations are classified by taking the most popular class (using a majority voting function) among all predicted classes by all the tree predictors in the RF (we calculate the average in regression case) [26]. To construct a decision tree classifier, many techniques are used, the most frequent ones are the Information Gain (IG) and the Gini Index (GI) [27]. In the present paper, we used the GI for the randomly feature selection measure.
Performance evaluation: To evaluate the efficiency of our HFACOC5 gene selection approach on breast cancer risk prediction, the well-known metrics accuracy and Fscore are considered. Based on the confusion matrix (Table 1), the metrics are calculated using Equation (5) and (6) presented below:

Positives
True Positive (TP) Patients diagnosed with cancer, and also the system predicted them with cancer.

False Negative (FN)
Patients diagnosed with cancer, but the system predicted them as healthy.

Negative
False Positive (FP) Patients diagnosed as healthy, but the system predicted them with cancer.

True Negative (TN)
Patients diagnosed as healthy, and also the system predicted them as healthy.
To further validate the classification performance of our prediction models, another popular metric used for performance comparison, the well-known AUC (area under the ROC curve) is used [28], which is the sum of successive trapezoid areas under the ROC (Receiver Operating Characteristic) curve [29]. The model that gives 100% of correct predictions (TP + TN = 100% of samples) has an AUC of 1, while the model that gives 100% of wrong predictions (FN + FP = 100% of samples) has an AUC of 0.
To have a well understanding of how our proposed approach behaves well, each metric of performance described above is computed using the formula in Equation (7).

Dataset source
The dataset was obtained from ELVIRA Biomedical Data Set Repository [30]. The dataset contains 24,481 scanned gene expressions with 97 instances, 51 of which are healthy, and the rest are diagnosed with cancer. The "NaN" symbol in the original data was replaced with the mean. Table 2 summarizes the dataset description.

Data preprocessing
To improve the prediction performance of our models, prior to feeding our gene expression data to any process of selection or classification, the gene expression levels of each gene were standardized using z-score formula as follow: Where x denotes the gene, μ is the mean and σ the standard deviation of that gene.

Stratified k-fold cross-validation
To avoid the statistical problem of over-estimating due to data partitioning, we used the stratified k-Fold Cross-Validation technique to split our data. Thus, samples were randomly divided into k equal-sized folds, with the same proportions of instances in terms of classes in all folds. The k-1 partitions (folds) are used to fit the model and the remaining partition is used to test de trained model; thus, we ensure that each class in the dataset has the chance to appear in the training folds and testing folds. Moreover, All the process of gene selection was run on the training set to obtain gene subsets. Then the test set was used to testify the classification accuracy of the obtained gene subsets. The Max, Min, average, and standard deviation results of performance metrics of classification were calculated to correctly evaluate the performance of our gene selection strategy. As the data contains fewer samples than the number of genes, we used the stratified 10-fold cross-validation on the whole breast cancer dataset as it is the most common practice in cross-validation.

Experimental settings
Using parallel processing, all experiments of our strategy were implemented in python 3.7 and tested on a machine with Intel E5-2637 v2 3.5 GHz and 64 GB of RAM using the operating system MS Windows 10.
To achieve better convergence, the parameter setting of our whole prediction system was empirically determined. However, we don't claim that these parameter values are optimal. Parameter optimization may be the subject of future research. For example, the ACOC5 algorithm was implemented with 10 ants and with a maximum number of iterations of 100. The initial pheromone intensity ( = 0) of each edge was set to 1, and pheromone evaporation value was set to 0.5. For the parameters that determine the relative importance ( ) of the pheromone and the heuristic information ( ) were set to = 1 and = 5, respectively. In the experiments conducted for performance evaluation of our proposed gene selection strategy, we varied the size of the gene subset between 5 and 20 with an increment of 5 (i.e., 5, 10, 15, and 20).

Results and discussions
In this section, we explain the experimental results using the proposed gene selection framework. To measure the performance of the proposed strategy, 10 experiments were conducted using the stratified 10-fold cross-validation evaluation technique (section 4.3). The overall experiment results in terms of classification performance (accuracy, F1-score, and AUC) are reported in Table 3 and Figure 3, including mean, max, min, and the standard deviation (SD) of the four classifiers (SVM, KNN, C5.0, and RF) for each of the selected gene subset. According to these results, the research space was reduced two times using our proposed gene selection approach. First, it passed from p = 24481 (the original number of genes in the input dataset) to k = 100 genes using Fisher-score based filter method, and then the new space passed in its turn from k genes to k′ = (5, 10, 15, and 20) using ACOC5 based wrapper approach. Each obtained subset of k′ genes was used to construct four classifier-based models (HFACOC5-SVM, HFACOC5-KNN, HFACOC5-C5.0, and HFACOC5-RF). As shown in Table 3 and Figure 3, the most relevant performance results were achieved by models based on decision tree classifiers (HFACOC5-C5.0 and HFACOC5-RF), because they achieved a higher performance rate (> 91%) in terms of all evaluation measures, irrespective of the size of the selected gene subset, while, the lowest performance rate was achieved by models based on KNN algorithm (HFACOC5-KNN). Also, it can be noticed that the classification performance slightly decreased with the increase of gene subset size, especially for models based on decision tree classifiers. For example, for the classification model HFACOC5-C5.0, the performance accuracy decreased from 95.44% (with F1-Score = 0.95 and AUC = 0.96) for five genes to 91.33% (with F1-Score = 0.91 and AUC = 0.91) for 20 genes, which may explain the positive impact of dimensionality reduction process on the prediction performance. The gene accession numbers for each selected gene subset are listed in Table 4.
Because our main aim is to predict the risk of breast cancer with high performance, based on the results shown in Table 3 and Figure 3, the shrinkage model HFACOC5-C5.0 with five genes was deemed to be the best model because it achieved the best performance prediction (Accuracy of 95.44%, F1-Score = 0.95, and AUC = 0.96) with the smallest number of involved genes (5 genes). Table 5 and Figure 4 gives more details about the experiments performances of our voted prediction model (HFACOC5-C5.0). As it can be noticed from Table 5 and also Figure 4, our favorite model achieved a maximum classification accuracy of 99-100% in 50% of all experiments (10 folds), and classification accuracy of 90-95% in 40% of the remaining experiments. Figure 5 also gives a better overview of the performance of our generated models that involve only k'=5 genes predicators. As we can notice from this figure, for our favorite shrinkage model HFACOC5-C5.0, the roc curves of five out of 10 experiments (folds) are almost superimposed on the "perfect performance" shown in dotted lines, which can confirm the choice of this model as the best generated one using our new gene selection strategy.

Conclusion and Future Work
The main purpose beneath this study was to develop and evaluate a classification prediction model for predicting the risk of breast cancer using gene expression data. A new hybrid approach-based gene selection (HFACOC5) was proposed to identify small gene subsets able to achieve high prediction performance. The idea of the proposed approach was to take advantage of both filters (Fisher-score) and wrappers (ACOC5). The Fisher-score selects the most informative genes by first filtering out irrelevant genes and then running ACOC5 over the resulting subset (to achieve maximum accuracy and minimum redundancy). After conducting experiments using the stratified 10fold cross-validation evaluation technique, using far fewer genes, our proposed strategy achieves high prediction performance in terms of all evaluation measures when it is coupled with Decision tree-based classifiers (a maximum accuracy performance of 99-100% in 50% of all experiments involving five genes). Moreover, as far as we know in the context of our research objective, this is the first time that the data partitioning process using the cross-validation technique was applied before the gene selection approach, which makes our results in terms of selected gene subset and prediction performance more credible than any previous work.
As future work, our proposed approach can be further improved on different aspects, such as considering other bio-inspired algorithms. Also, including experimentation on new microarray data can enable us to test the effectiveness of our strategy far more.