Improving Penalized Logistic Regression Model with Missing Values in High-Dimensional Data

— Analysis without adequate handling of missing values may lead to inconsistent and biased estimates. Despite multiple imputations becoming a widely used approach in handling missing data, manuscript researchers generally encounter missing data in their respective studies. In high-dimensional data, penalized regression is a popular technique for performing feature selection and coefficient estimation simultaneously. However, one of the most vital issues with high-dimensional data is that it often contains large quantities of missing data that common multiple imputation approaches may not work correctly. Therefore, this study uses imputations penalized regression models as an extension of the penalized methods to improve the performance and impute missing values in high-dimensional data. The method was applied to real-life high-dimensional datasets for the different number of features, sample sizes, and missing dataset rates to evaluate its e ﬃ ciency. The method was also compared with other existing imputation penalized methods for high-dimensional data. The comparative experimental results indicate that the proposed method outperforms its competitors by achieving higher sensitivity, specificity, and classification accuracy values.


Introduction
Missing data exist in almost all areas of biomedical, epidemiological, and social research.This may be due to various reasons including unavailability of measurements, survey nonresponse, and data loss [1].Many statistical techniques often require complete cases without any missing data.This as inaccurate estimates and conclusions may result from an analysis that does not properly handle missing values [2].Though, the problem of missing data may be addressed using a number of statistical approaches.Jiang et al. [1] argued that ignoring the observation with missing values is a straightforward solution.This is due to the fact that when there are a few observations with missing values, there is usually no significant problem.However, delete a high number of observations with missing values, on the other hand, results in a considerable loss of data [3], [4].It also has a negative impact on the data's statistical power and efficiency [5].By filling in the missing values with some reasonable values, imputation produces the complete data without eliminating the missing cases for analysis.Some ad-hoc methods, including mean substitution, maximum likelihood approaches, single imputation, and multiple imputation (MI), can be used to impute missing data [6].Therefore, to overcome the missing values in high-dimensional data, reliable imputation approaches are required.
High-dimensional data is another issue that is often faced in a wide range of scientific study domains, including genetics, health sciences, economics, chemometrics, sociological surveys, environmental sciences, finance, and machine learning, amongst others [7].In high-dimensional data analysis, variable selection is crucial.Recently, the use of gene selection techniques in biological datasets has risen significantly, where the number of genes is usually more than the number of samples [8], which can lead to overfitting and have a detrimental influence on learning.Furthermore, only a few genes have relevant meanings and are directly related to the associated disease from both a biological and knowledge discovery standpoint [9].As a result, identifying informative genes is an efficient technique to handle these challenges, which may be considered a machine learning feature selection problem [10], [11].In the previous decade, there has been major progress in variable selection methods.Among these methods, penalized methods were identified.The penalized method is used to select features and classify them.Penalized logistic techniques are those that include a type of penalty term into the logistic regression in order to perform both selection and classification simultaneously.The logistic regression method has attracted a great deal of attention.A variety of logistic regression models with varying penalties may be utilized."Least Absolute Shrinkage and Selection Operator" is the name of one of these penalties ("also known as Lasso"), which is based on the L1-norm [12].Another penalty that is based on the L2-norm is ridge regression [13].Other penalties are the so-called "Smoothly Clipped Absolute Deviation" (SCAD) [14], the elastic net [15], the adaptive Lasso method [16], and the adaptive elastic net methods [17], [18].
Consequently, in high-dimensional data, penalized regression is a popular technique for performing variable selection and coefficient estimation simultaneously.However, one of the most vital issues with high-dimensional data is that it often contains large quantities of missing data.According to previous researches, most microarray datasets are incomplete to varying degrees, ranging from fifty percent to ninetyfive percent [19].Multiple imputation (MI) [20], [21] has become a widely used approach in handling missing data, with significant improvement in the methods and software [22], [23].However, MI approaches may not work correctly in highdimensional data, where the number of variables () in the imputation model exceeds the sample size (), i. e. , ( >  or  ≫ ) [2].As it is now, the problem gets more critical, and conventional likelihood estimates become unavailable.It has also become challenging to apply sequential regression imputation in this situation.[6], [24].
In the existence of high-dimensional data, it is possible that the current MI methods and software packages may perform inadequately.To address this issue, this study uses imputations penalized regression models as an extension of the penalized meth-iJOE -Vol.18, No. 02, 2022 ods to improve the performance and impute missing values in high-dimensional data.This is done by employing the "one-dimensional weighted Mahalanobis distance" (1-DWM) as an initial weight inside L1-norm with imputing missing values for each predictor variable (feature).The proposed method referred by imputations adaptive penalized logistic regression (IAPLR) is compared with other existing imputation methods for high-dimensional data.The remainder of this article is arranged in the following.Detailed descriptions of the materials and methods are included in Section 2. Section 3 presents and debates the findings of the experimental investigation designed to assess the effectiveness of IAPLR compared to other penalized approaches.This paper is then concluded in Section 4.

2
Materials and methods

Missing data imputation
The missing data is one of the most prevalent problems in several fields of research.Traditional statistical methodologies demand entire cases without missing data in order to analyze the data.The removal of missing data is a loss of important data, and which could lead to an inaccurate statistic inference.Though, by imputing plausible values to the missing values, the imputation provides the whole data without removing the missing analytical data.Little and Rubin [20], [21] divided missing mechanisms into three main categories: First, missing completely at random (MCAR).This involves missing data independently of both observed and unobserved data.The second is Missing at Random (MAR).It is in the probability of a missing value, which is determined by the observed values but not by the data values that are missing., i.e., (/ ) = (/ ).The third is missing not at random (MNAR), in which the probability of a missing data value is determined by the missing data, i.e., (/ ) ≠ (/  ).
For a missing real value in a dataset, single imputation approaches create a specified value.This method has a lower computational cost.The researchers have proposed a variety of single imputation strategies.The primary strategy is to analyze other replies and choose the most significant possible response.The value may be calculated using the mean, median, and mode of the variable's available values [3].Imputed values are treated as actual values in single imputation.The uncertainty of the imputed values is ignored in single imputation-based approaches.Standard errors may exist for these values.As a result, the results are biased [25].Also, for single imputation, other methodologies, such as machine learning-based methods, may be utilized [26].
However, using several simulation models, MI methods yielded various values for the imputation of a single missing data.The variability of imputed data is introduced in these approaches to find a range of reasonable responses.MI approaches are more complicated than single imputation, but they do not suffer from bias values.MI can be summarized into three steps.The first step is imputation, in which  independent imputed values matching to missing data are obtained.The analysis is the second step, which involves analyzing each of the  imputed datasets using standard statistical techniques for complete data.The third step combines results of the analysis, in which  sets of desired estimates are combined into one set of parameter estimates using Rubin's rules [27].Several previous studies have proposed packages in  to implement MI methods more efficiently.One of these packages is called "Multivariate Imputation via Chained Equations" (also known as packages "mic") [22].Other packages are "mi" [23] and "Amelia" [28].

2.2
The penalized logistic regression model The logistic regression is a statistical approach for predicting the value of a categorical response variable with just two potential values represented by 0 and 1.When dealing with low-dimensional data, logistic regression works well.Nevertheless, when dealing with high-dimensional data sets, such as those including gene expression data, it may become inefficient in terms of prediction accuracy and computational efficiency.Another issue that affects the use of logistic regression is overfitting, which occurs when the number of features exceeds the number of observed values [29].The logistic regression with a penalty is used in various classification fields to perform gene selection and classification simultaneously [30].This model is penalized, and its coefficients are shrunk as part of the regularization procedure [31].Over the last decade, penalized regression approaches have gained popularity due to their superior prediction accuracy and computational efficiency.
For illustration purposes, suppose a set of data is designed as a matrix  ∈  × ( ≪ ),  = ( 1 ,  1 ), ( 2 ,  2 ), . . ., (  ,   ) , where each column indicates a feature, each row denotes a sample,   = ( 1 ,  2 , . . .,   ) is the  ℎ input sample, the entry  , denotes the value of the  ℎ feature of the  ℎ sample and  = ( 1 , . . .,   )  is the  −dimensional vector of binary responses coded as {0, 1}.The class posterior probability is defined in the logistic regression function as follows: where  = ( 1 , . . .,   )  is a  −dimensional vector of the unknown parameters.Then, the estimator  ̂ is obtained as the minimizer of the log-likelihood function as follows: The classification method of logistic regression is a powerful discriminative tool (variable selection).However, logistic regression is not useful as a classification technique when the dataset is high dimensional since the design matrix is singular.As a result, it is unable to produce accurate regression coefficient estimations.Furthermore, overfitting occurs when datasets are high dimensional, such as when there are a large number of genes (or features in general).Furthermore, multicollinearity might affect its estimators [32], [33].
From a statistical point of view, other (unrelated) features may cause noise and reduce classification effectiveness.To increase classification accuracy, statisticians commonly use feature selection methods that can eliminate irrelevant and redundant features.Besides the logistic regression, there are other classification methods available, such as penalized logistic regression (PLR), which is used to reduce high dimensionality and enhance classification accuracy [34].Although regularization methods are often applied to high-dimensional data, [35] claimed that they might also be applied to low-dimensional data.
The log-likelihood function is modified by the addition of a positive penalty term in penalized logistic regression, imposing certain coefficients to become zero to produce a sparse solution.The PLR penalizes a logistic model with too many features by adding a penalty term to the equation.As a result, when the coefficients are constrained, the coefficients of less essential features become either extremely near to zero or precisely zero.Regularization is another name for this technique.The following is the setting for the technique.
The penalized log-likelihood is represented as: where, () indicates a regularization term that can be expressed in different forms and  > 0 denotes a control parameter.Then the PLR of Eq. ( 3) is minimized with regard to  to obtain estimates of the coefficients.The use of a penalty ensures that each parameter has a unique estimate and results in better predictions than the conventional "Maximum Likelihood Estimation" (MLE), with a reasonable balance between bias and variance [36].Without loss of generality, y and the columns of X are considered to be standardized, ∑    =1 = 0, ∑    =1 = 0, and for  = 1,2, . . ., .Consequently, the intercept term( 0 ) is not penalized. is estimated employing Lasso technique by: where,  is a control parameter.When  = 0, Eq. ( 4) reduces the likelihood estimator to its lowest possible value.As  → ∞ penalization forces all features to be zero.
The adaptive Lasso (ALasso) method is an extension of Lasso.It was originally proposed by [16] to overcome the shortcomings of Lasso by combining the L1 penalty with the weighted penalty [37].Zou [16] modified the L1-penalty by providing various weights to various coefficients in order to make it more efficient.Shrinkage techniques such as Ridge, Lasso, and other similar methods might be used to assign weights.The ALasso associated with the logistic regression is given by: where, ,  ≥ 0 and  ̂ is an initial estimate for each   estimated using the Lasso technique or other shrinkage techniques.Here we set  = 1, for simplicity.

The proposed method
Missing data is a problem that affects performance in data analytics.An inaccurate prediction might result from incorrect imputation of missing values.Recently, when a vast amount of data is created every second, data usage become a key concern for stakeholders.Thus, managing missing data efficiently becomes increasingly crucial.This research is also motivated by the fact that in PLR, the L1-norm penalty may be used to apply the PLR approach to high-dimensional data sets.However, because the L1-norm is inconsistent with feature selection, this technique may result in the selection of irrelevant and redundant features [38].To put it another way, PLR estimates based on the L1-norm may be biased for large coefficients since they receive more enormous penalties.
Peng et al. [39] employed the "one-dimensional weighted Mahalanobis distance" (1-DWM) as a criterion of gene efficiency to extend the effect of individual genes to the joint impact of multigene, that is defined as: where   is a column vector, denotes feature across samples, and   2 =  1 . 1 2 +  2 . 2 2 , denotes the weighted variance of feature ,   2 denotes variance of feature  in class ,   is the prior probability or weight of class , where  = 2 in this study and  1 =  2 = 0.5.
Therefore, this study uses imputations adaptive penalized logistic regression (IAPLR) as an extension of the penalized methods to improve the performance and impute missing values in high-dimensional data.This is done by employing the (1-DWM) as an initial weight inside L1-norm with imputing missing values for each feature (gene).The proposed method addresses missing values and improves feature selection in high-dimensional.
The  ℎ component of the p-dimensional vector of features is denoted as: where ( . ) is the weight for every feature j that is indicated as Eq. ( 6).
The proposed method imputes missing values by the "naniar" package in R.Moreover, to alleviate inconsistency in feature selection, the proposed weight in this work gives a relatively large amount of weight to the feature with a low ratio value while giving a small weight to the feature with a high ratio value.The IAPLR becomes capable of reliably picking related features after correctly assigning weights to features.Figure 1  ]. (8)

Evaluation metrics
In this subsection, three evaluation metrics are used to evaluate the performance of the method.These criteria are widely used in the healthcare setting [40].These criteria involve classification accuracy (CA), sensitivity (SEN), and specificity (SPE) that are given as: where TP, FP, TN, and FN are denoted in Figure 2. The greater the values of the applied assessment criteria, the better the classification performance is expected to be.

Dataset description
In order to assess its effectiveness, the proposed method (IAPLR) is used for two datasets with varying numbers of genes and observations.These datasets are freely accessible and have been used by a large number of researchers in the previous.First, the colon cancer data set, in which the number of observations is 62 people (40 malignant tumors and 22 noncancerous cells) and 6500 genes.Affymetrix oligonucleotide array technology was used to get it.Only 2000 gene expressions were utilized in this data set, and they were selected based on the samples' lowest minimum intensity [41].The second data set is the Bipolar disorder (Bip) dataset, which had a sample size of 61 observations, including 31 control observations and 30 bipolar disorder observations.Again, Affymetrix technology was used to capture the expression of 22,283 human genes.[42], [43].

Results and discussion
In this section, the datasets described above were considered to show various methods regarding feature selection with missing values.The proposed method (IAPLR) was demonstrated to be efficient throughout comparative experiments with Lasso and ALasso.We first applied these methods to complete data without missing data.We randomly partitioned each dataset into a training dataset with 70% of the samples and a test dataset with 30% of the samples to allow for a fair comparison.The 10-fold cross-validation (CV) was used with the training dataset 100 times in order to obtain the optimal value of , using the "glmnet" package in R. On the other hand, to evaluate the methods with missing values, the process as follows.First, we seed missing values in our datasets with the different rates (10%, 20%, 30%) using the "missForest" package in R.This study assumes no missing data in the response variable.Secondly, we used the "naniar" package of the programing language R to Prediction (+) Prediction (−) impute the missing values.Thirdly, we applied penalized methods on imputing data as complete data.The average number of selected genes, the averaged CA, SEN, and SPE in both the training and testing datasets are presented in Tables 1 and 2.
It can be seen from the data in Tables 1 and 2. The proposed method selected fewer genes than the Lasso and ALasso in both colon and Bip datasets with different rates of missing values.For example, in the colon with 20% missing data, IAPLR selected 13 genes compared to 16 genes for Lasso and 15 genes for ALasso.On the contrary, we observed that Lasso usually produces the highest number of picked genes in both datasets.
Furthermore, we observe in Tables 1 and 2 that in both datasets used in this research, the average CA, SEN, and SPE in both the training and testing sets of IAPLR are much better than that of Lasso and ALasso.For instance, in colon data with 10% missing values, the CA of IAPLR in the training set is (96%), which is better than (93.91%) for Lasso and (93.82%)ALasso.Additionally, in Bip data with 30% missing values, the SEN of IAPLR is 87.93%, which is better than that of Lasso and ALasso, 81.39%, and 83.86%, respectively.The same conclusion can be made from the testing sets in the colon and Bip datasets with different rates of missing values.
To further highlight the performance of the IAPLR, it is required to conduct statistical tests in order to investigate whether the differences in classification accuracy obtained in Tables 1 and 2 are statistically significant or not.In this study, the paired t-test was utilized to analyze the data.Tables 3 and 4 present the findings.The relative improvement in the mean of average accuracy that the proposed method provides in comparison to the other methods is represented by the column "improvement".In addition, Tables 3 and 4 demonstrate that there is a statistically significant difference between our proposed method, IAPLR, and each competing approach at the 5% level of significance.

 
Overall, these results indicate that the IAPLR has been effectively applied to improve gene selection, classification, and dealing with missing values in highdimensional data.It achieved higher CA, SEN, and SPE in both the training and testing datasets.Hence, IAPLR is nominated as a potential gene selection approach since it can simultaneously satisfy all three of these criteria.Furthermore, when compared to competitor approaches, the proposed penalized technique is the most effective classification technique.This illustrates that the weights of the genes are taken into account by the IAPLR method.

Conclusion
In data analytics, the imputation of missing data is extremely important.Unfortunately, it is difficult to find a missing data imputation method that works for all types of datasets.Although there has been significant progress in the methods and tools for variable selection, missing data often occurs in extensive, complicated research and which can make data analysis challenging.In this study, it is mainly focused on improving the performance of penalized logistic regression models and handling missing values in high-dimensional data through the IAPLR method.The IAPLR, Lasso, and ALasso were applied to two datasets (colon and dip) in the presence of the different rates of missing values.The findings of comparative experiment demonstrated that the efficiency of IAPLR in the presence of missing data is better than the efficiency of the other two techniques in terms of CA, SEN, and SPE.The findings also showed that the IAPLR method for classification and gene selection is a statistically significant one.sia, in 2017.He is a professor in Statistics at University of Mosul.His research interest includes the development of high dimensional data, generalized linear models, bioinformatics, chemoinformatics, and optimization algorithms (Email: zakariya.algamal@uomosul.edu.iq).

Table 1 .
The averaged criteria over 100 times for the training and testing colon dataset

Table 2 .
The averaged criteria over 100 times for the training and testing Bip dataset

Table 3 .
Significant test results of paired t-test for the training and testing colon dataset

Table 4 .
Significant test results of paired t-test for the training and testing Bip dataset