Improved Lasso (ILASSO) for Gene Selection and Classification in High Dimensional DNA Microarray Data

Classification and selection of gene in high dimensional microarray data has become a challenging problem in molecular biology and genetics. Penalized Adaptive likelihood method has been employed recently for classification of cancer to address both gene selection consistency and estimation of gene coefficients in high dimensional data simultaneously. Many studies from the literature have proposed the use of ordinary least squares (OLS), maximum likelihood estimation (MLE) and Elastic net as the initial weight in the Adaptive elastic net, but in high dimensional microarray data the MLE and OLS are not suitable. Likewise, considering the Elastic net as the initial weight in Adaptive elastic yields a poor performance, because the ridge penalty in the Elastic net grouped coefficient of highly correlated genes closer to each other. As a result, the estimator fails to differentiate coefficients of highly correlated genes that have different sign being grouped together. To tackle this issue, the present study proposed Improved LASSO (ILASSO) estimator which add the ridge penalty to the original LASSO with an Adaptive weight to both l1 − norm and l2 − norm simultaneously. Results from the real data indicated that ILASSO has a better performance compared to other methods in terms of the number of gene selected, classification precision, Sensitivity and Specificity. Keywords—high dimension data, Penalized Adaptive Elastic net, ILasso, logistic regression, gene selection, cancer classification


Introduction
Genetics and molecular biology has been transformed recently as a field of research for genes selection and classification in high dimensional DNA microarray data. In DNA microarray data the number of genes is more than thousands from a hundred or less sample tissue [1], [2]. Because of high dimensionality of DNA microarray cancer data, genes selection and classification become a challenging issue in biomedical research [3], [4]. Classification of cancer and selection of genes have been extensively studied in recent years. Thus, application of penalized logistic regression method aid for the identification of significant genes that are biologically relevant for a specific cancer [5]- [7].
Several methods of genes selection have been suggested to choose small and appropriate genes in high dimensional cancer classification. Some penalized likelihood methods that are capable of performing estimation of model and selection of genes efficiently have been recently considered in [8]- [10]. The most frequent and popular penalized likelihood method is the least absolute shrinkage and selection operator (LASSO) proposed in [11]. The ℓ 1 − penalty is imposed to the loss function in LASSO. Variable selection can easily be performed by pushing some gene coefficient to exactly zero because of the ℓ 1 − property [12]. Due to this motive LASSO gained recognition in classification of cancer in high dimensional data.
However, despite the advantageous of LASSO, it has some drawbacks. Firstly, in a scenario of > , LASSO cannot select variables more than the size of sample (n) as a result of convex optimization problem [13], [14]. Secondly, LASSO is inconsistent in variable selection as it applies the same amount of penalization to all coefficients which resulted to a biased estimation of large coefficients. Thus oracle property is not obtainable [15]. Lastly in cancer classification with high dimensional microarray data, LASSO cannot address the effect of grouping [16], [17]. Furthermore, penalized Elastic net was proposed by Zou and Hastie in [18] which is a mixture of LASSO penalty and Ridge penalty to address some shortcomings of LASSO.
Conversely, like LASSO, the Elastic also has a limitation of not enjoying the oracle property even though it has a better performance than LASSO. To tackle the effect of grouping of highly correlated genes as well as enjoying oracle property in high dimensional data Zou and Zhang proposed Adaptive elastic net in [19]. Though, in the Adaptive elastic net the estimator has influence to the initial weight. Many studies from the literature have proposed the use of ordinary least squares (OLS), maximum likelihood estimation (MLE) and Elastic net as the initial weight in the Adaptive elastic net, but in high dimensional microarray data the MLE and OLS are not suitable [20]. Likewise, considering the Elastic net as the initial weight in Adaptive elastic yields a poor performance, because the ridge penalty in the Elastic net grouped coefficient of highly correlated genes closer to each other. As a result, the estimator fails to differentiate coefficients of highly correlated genes that have different sign being grouped together [21]. To tackle this issue, the present study proposed Improved LASSO (ILASSO) estimator which add the ridge penalty to the original LASSO. Also, in ILASSO, an Adaptive weight is added to both ℓ 1 − and ℓ 2 − simultaneously. The reminder of this paper is organised as follows: introduction is presented in section 1, section 2 displays the penalized Adaptive regression methods and the proposed technique used in the present study. While discussion of results is presented in section 3, and finally section 4 covers the conclusion of the study.

Penalized logistic regression
A statistical method that model a binary classification problem is called logistic regression. The response variable in classification of cancer for logistic regression method consist of only two values 1 or 0 for tummor and normal class respectively. For binary classification, the regression function has a nonlinear relation with the linear group of genes [22]. Let's denote the response variable for observation ∈ (0,1), ℎ = 1,2,3, … … , . Assume genes with − ℎ gene vector represented by = ( 1. , 2, , 3. , … … . . , , ) . Then the linear predictor is given by: Given the linear predictor, the logit transformation is carried out as follows: The probability of observation belonging to the positive class, i.e., malignant, can now be calculated as follows: Where, = ( 0 , + 1 , … ) represent the genes coefficient of the unknown vector. Thus, the log-likelihood function of the penalized logistic regression is defined below: Normally cancer data is considered high dimensional data with ≫ . A non-negative penalization term is added to the negative log-likelihood function to control the number of genes coefficients in Eqn. (4). Numerous popular penalization term has been discuss in [11], [18], [23]. The most common penalization term is the ℓ 1 − proposed by Tibshirani in [11]. In penalized logistic regression (PLR) the ℓ 1 − is considered as the penalty term that perform gene selection and classification by constraining the negative log-likelihood function for estimation and classification of gene coefficients, that is, − ( ) + ( ), ℎ ( ) is used to penalized the estimated coefficients. The positive turning parameter is employed to grip the degree of the shrinkage. When = 0, the penalization term is deserted and PLR abridged to logistic regression with maximum likelihood estimation (MLE). Though, the higher the value of the higher the effect on the estimated coefficients. One important part of fitting a model, is the choice of the turning parameter. If the interest is on classification, the turning parameter is usually chosen to minimize the misclassification error by computing the optimal balance where an increase in the variance resulted to a decrease in the bias. Therefore, the objectives function for PLR is describe as: It is assumed without loss of generality that the vectors of genes are standardized, that is., . }. one of the most common PLR methods in the state of art (literature) is LASSO, where the objective function for LASSO is given as follows: Note that in LASSO the value of i.e, the turning parameter has a high influence on the estimated gene coefficients. As a result, a smaller value of reduces some coefficient to zero absolutely. This indicated that LASSO can effectively choose some variables by reducing the coefficient of others to exactly zero. Thus, for features selection in different scientific area LASSO can be used effectively [24], [25]. Method of cross validation is normally employed for selecting the value of for identifying the best value of that has the smallest measure of accuracy (mis classification error) [26]. Although LASSO has the benefit of performing features selection in higher dimensional data, it has some limitations. In a group of highly correlated variables, LASSO can only choose one variable among the group. Also, LASSO cannot select explanatory variables more than the size of sample (n) and at the same time LASSO does not enjoy the oracle properties.
To tackle these aforementioned shortcomings of LASSO, another penalized likelihood method was proposed in [18], [27]. Elastic net is another penalized regression method which is a combination of ℓ 2 − and ℓ 1 − penalties. Elastic net handle the problem that occurs when ℓ 1 − is used in the scenario of highly correlated variables. Penalized logistic regression using Elastic net is modelled as follows: It can be seen from Eqn. (7) that two turning parameters are used for Elastic net estimator, i.e., . Usually, method of cross validation is employed to obtain the optimal value of for a fixed value of

Penalized Adaptive Logistic Regression
Zou in [23] proposed penalized Adaptive LASSO (ALASSO), to improve the selection consistency of LASSO by introducing a data driven weight to the ℓ 1 − . The idea in the ALASSO is that, by considering moderately large quantity of the shrinkage for the zero coefficient and for the nonzero coefficients small quantity, this may likely decrease estimation biased and enhanced selection of variables stability. Fundamentally, the ALASSO is a convex optimization challenged with ℓ 1 restriction. Therefore, the efficient algorithm for LASSO solution can be employ as well to solve ALASSO. The penalized Adaptive LASSO is therefore defined as: Where represent the weight for the − ℎ coefficient and is computed as; = (|̂|) − , such that ̂ is the initial value of the estimator and > 0. The true value of the estimate converge with O ( −1 2 ⁄ ) [20]. Adaptive LASSO has a drawback of not handling the group of highly correlated genes in high dimensional cancer data. As a result, Elastic net was proposed to address the effect of grouping in highly correlated variables as stated in [28]. Despite the effectiveness of Elastic net in dealing with grouping effect in highly correlated variables, the Elastic net does not possess oracle property [20]. Hence, the Adaptive Elastic net was proposed in [19], [29] which is presented mathematically as:

Proposed method
Penalized Adaptive Elastic net is consistent in selection of variables as well as handling the effect of grouping and has the oracle property. But the initial weight has influence on the performance of the estimator. For estimating the initial weight to the Adaptive Elastic net, existing works have proposed the used of maximum likelihood estimation (MLE) and ordinary leas squares (OLS). However, for data of lower dimensional, MLE and OLS may be appropriate, but for classification of cancer in high dimensional data they have a poor performance [19], [29], [30]. Equally, considering Elastic net as the initial weight is not suitable due to the fact that, the ridged penalty in the Elastic net grouped the coefficients of correlated genes closer to each other. Though, as a result the estimator fails to distinguish significantly highly correlated genes that have different signs [19]. This problem resulted in genes with different features to be grouped together. In the present paper, ILASSO was proposed to address this problem by introducing the ridge penalty to the novel LASSO estimator. As a result, the estimator can easily identify group of highly correlated genes with different magnitudes.
The objective function for ILASSO is given as follows: Note that in ILASSO, the adaptive weight is applied to the LASSO penalty and ridge penalty respectively. The turning parameter is titled towards the ridge penalty to achieved the effect of grouping. Furthermore, the Adaptive weight in ILASSO are chosen in an exclusive manner to take care of coefficient of highly correlated genes, in other word, the initial weight is generated using LASSO estimator. For the estimator to maintain the grouping effect of highly correlated genes the value of is considered to be > 0.5. However, applying LASSO estimator as the initial weight in the ILLASSO, we attain a shrinkage so that coefficient of highly correlated genes especially those with different magnitude will be pushed to exactly zero because of the higher value of .

Discussion of results
The proposed ILASSO method has been evaluated using a well-known microarray data, kwon as Diffuse Large B-cell lymphoma (DLBCL) available in [31]. The dataset consists of 77 samples of genes expression values that are measured with high density oligonucleotide microarrays of the most two prevalent adult lymphoid malignancies which comprise 58 samples of diffuse large b-cell lymphomas and 19 samples of follicular lymphoma (FL).
For purpose of comparison between the propose ILASSO and other methods DLBCL dataset is divided into testing and training data set. The training data set was generated randomly with 70% of the original data whereas 30% of the remaining data set was used as the testing data. Method of 10-fold cross validation (CV) was employed to obtain the optimal value of . The study apply a turning grid in obtaining the optimal of where the value of is incremented in steps and in each iterations the resulting performance metrics is presented. Then for evaluating an estimator the value of that gives the best result is considered. All computations in this paper was carry out using R software.
Similarly, number of gene selected (#G), the Area under the curve (AUC), Sensitivity (Sen), Specificity (Spe) and informediness (IF) are the performance metric used for the study. One major important factor in higher dimensional DNA microarray data is the number selected by each techniques [32] [33], [34]. Simple and interpretable model is not available by the method that selected higher number of genes.
From Table 1, it can be observed clearly that the proposed ILASSO method outperform the other four methods considering the number of genes chosen, ILASSO selected 6 number of genes compared to LASSO, ALASSO, Elastic net and Adaptive Elastic net that obtain 29, 12,50 and 14 respectively. Also for classification accuracy, from Table 1, ILASSO is superior with the highest value of AUC 0.99 which is much better than 0.930, 0.980, and 0.960 attained by LASSO, ALASSO, Adaptive Elastic net and Elastic net. In addition, ALASSO and Adaptive approaches with a specificity from table 1, ILASSO is considered superior to the other methods. For example,the proposed ILASSO has the highest sensitivity of 95%. This indicated that ILASSO significantly outperformed the other methods in classifying patients having DLBCL cancer with a probability of 0.95. Likewise, outcomes for specificity from table 1 which stand for the probability of an adaptive penalized techniques in classifying normal patients. It can be seen in terms of specificity that ILASSO and ALASSO has a better performance with a sure probability of 1compared to LASSO, Elastic net and Adaptive Elastic net. This might be as a result of be as a result of the new weight added to both LASSO and Ridge penalty.
To further evaluated the performance of the proposed ILASSO, an alternative measure of imbalance data call informedness (IF) was used. We can observe from Table 1 that ILASSO outperforms the other with an IF of 0.952. However, this shows the ability of our proposed technique in measuring and evaluating predictive models for DLBCL cancer data set.
Conversely, Fig. 1 -3 displays the 95% confidence interval for the mean of AUC, Sensitivity and Specificity for DLBCL cancer data. It can be clearly seen from Figure  1 -3 that ILASSO emerged as the best classifier with smaller variance for all the performance metrics. Moreover,ILASSO and ALASSO has similar results of classifying patients having DLBCL cancer with a xure probability of 1 over the 50 partition metrics.

Conclusion
Classification of cancer is one of the most significant area of research in high dimensional DNA microarray data. However, various computational methods are unsuccessful in selecting and classifying small subsets of significant genes due to problem of high dimensionality of cancer data. To tackle this problem, the current study proposed ILASSO penalized logistic regression method for gene selection and classification in high dimensional data. Results from the study has confirmed the superiority of the proposed ILASSO method in terms of number of genes selected, classification precision, Sensitivity and Specificity.
Overall, the results indicated that ILASSO is capable of handling and analysing high dimensional data correctly. Furthermore, the proposed ILASSO results can be useful