A New Smooth Support Vector Machine with 1-Norm Penalty Term

—Recently, soft margin smooth support vector machine with 1-norm penalty term (SSVM 1 ) is discovered to possess better outlier resistance than soft margin smooth support vector machine with 2-norm penalty term (SSVM 2 ). One of the most important steps in the framework of SSVMs is to replace the x + by a differential function in the primal model, and get an approximate solution. This study proposes one function constructed by Padé approximant via the formal orthogonal polynomials as the smoothing technique, and a new 1-norm SSVM, Padé SSVM 1 , is represented. A method for outlier filtering is proposed to improve the ability of outlier resistance. The experimental results show that Padé SSVM 1 , even without outlier filtering, performs better than the previous SSVM 2 and SSVM 1 on the polluted synthetic datasets.

INTRODUCTION Support vector machines (SVMs) have been proven to be one of the promising learning algorithms for classification [1]. The standard SVMs have loss + penalty terms measured by 1-norm or 2-norm measurements. The loss part measures the quality of model fitting and the penalty part controls the model complexity. In [2], Li-Jen Chien et al. showed that the measurement of the 2-norm loss term amplifies the effect of outliers much more than the measurement of the 1-norm loss term in training process. From this robustness point of view, the authors in [2] developed a SSVM 1 whose loss term is measured by 1-norm and the integral of the sigmoid function was selected as the smoothing technique (Sigmoid SSVM 1 for short). Finally, the experiments in [2] showed that Sigmoid SSVM 1 can remedy the drawback of 2-norm soft margin smooth support vector machine (SSVM 2 ) [3] for outlier effect and thus get outlier resistance.
Although SVMs have the advantage of being robust for outlier effect [4], there are still some violent cases that will mislead SVM classifiers to lose their generalization ability for prediction, even the good sigmoid SSVM 1 also became powerless at this time. Li-Jen Chien, Y.J. Lee, Z. P. Kao, and C. C. Chang [2] proposed a heuristic method to filter outliers among Newton-Armijo iteration of the training process and make SSVMs be more robust while encountering datasets with extreme outliers.
In this study, we will give a new smoothing technique, Padé approximant, which can approximate the plus function x + = max{x; 0} more accurately than the integral of the sigmoid function. The SSVM 1 smoothed by this function is denoted by Padé SSVM 1 . We will show that the outlier resistance of Padé SSVM 1 is better than that of Sigmoid SSVM 1 in most of the cases, even still performs well in those violent cases. We will also give another strategy for outlier filtering, which turns out to be efficient to make SSVM 2 and Sigmoid SSVM 1 be robust for those datasets polluted with extreme outliers.
II. 1-NORM SOFT SVM (SSVM 1 ) Consider the binary problem of classifying m points in the n-dimensional real space R n , represented by an m × n matrix A. According to membership of each point A i ∈ R n×1 in the classes +1 or -1, D is an m × m diagonal matrix with ones or minus ones along its diagonal. Similar to the framework of SSVM 2 [3], the classification problem can be reformulated as follows: As a solution of problem (1), the slack variable ξ is given by Thus, we can replace ξ in constraint (1) by (2) and convert the SVM problem (1) into an equivalent SVM which is an unconstrained optimization problem as follows: ( 1) The problem is a strongly convex minimization problem without any constraint. Thus, problem (3) has a unique solution. Obviously, the objective function in (3) is not twice differentiable which precludes the use of a fast Newton method, because it always requires the objective function′s gradient and Hessian matrix. Y. J. Lee  This ρ function with a smoothing parameter η is used here to simultaneously smooth and approximate the model (3), i.e., we use a differential (twice differentiable at least) function ρ to replace the plus function (・) + in (3) in order to get an approximate solution of the model. Finally, we obtain the 1-norm smooth support vector machine with respect to the integral of the sigmoid function (Sigmoid SSVM 1 for short): ( 1) By taking the advantage of the twice differentiability of the objective functions on problem (5), a prescribed quadratically convergent Newton-Armijo algorithm [5] can be used to solve this problem. Hence, the smoothing problem can be solved without a sophisticated optimization solver.
The transformation from (3) to (5) raises a very natural question: Are the two models equivalent? In fact, the model after smoothing is not equal to the primal problem (3) anymore. But in an analogous manner as in [3], it is easy to be proved that the solution of (5) converges to the unique solution of the primal problem when the smoothing parameter η in the SSVM 1 approaches infinity. It is just because of the truth: if the value of η increases, the ( , ) x ρ η will approximate the plus function more accurately. Therefore, how to construct an efficient smoothing technique to achieve the simultaneous smoothing and approximation naturally becomes the major goal of this study.

III. 1-NORM SMOOTH SUPPORT VECTOR MACHINE
BASED ON PADÉ APPROXIMANT In this section, we propose a kind of rational function, namely Padé approximant, as the smoothing technique to simultaneously smooth and approximate the plus function in the framework of SSVM 1 .

A. Padé Approximation via the FOP
Let f(x) be a given power series with coefficients ∈ i c C , where the right-hand side denotes a power series in x with lowest order term of degree m+n+1 or higher, then Let c (h) : P→C be a linear functional on the polynomial space P, which is defined by where with the convention that c i = 0 for i < 0. We now give the definition of formal orthogonal polynomials (FOPs) associated with c (m-n+1) , which is defined by [7] with h=m-n+1.
Now we present a main theorem (its proof is referred to [8]) about Padé approximation via the formal orthogonal polynomials (PAVOP) as follows.
where ( ) That is, B. Padé Approximant for x + We now consider using a Padé approximant to simultaneously smooth and approximate the plus function x + .
It is well known that the plus function is not smooth, but continuous, so we can expand the plus function to a power series: Then a Padé approximant for the above power series is computed by Thereom 3.3: and then a Padé SSVM 1 model is constructed: where 1 denotes a column vector of ones for arbitrary dimension, and function P has an effect on all components of a matrix or a vector in (21), i.e., P(1-D(Aw+1b),η)∈R m ,(P(1-D(Aw+1b),η)) i =P(1-D i (A i w+b), η), and η whose value is not a main factor for the final SSVM 1 is called smoothing parameter. We will now show a simple theorem that bounds the difference between the plus function x + and its smooth approximant  P(x, η). Theorem 3.4. Let x ∈ R, P(x, η) are defined as (20), x + is the plus function: (i) P(x, η) is quadratic smoothness, at the point x = ±1/η, x=0, satisfies:   The Newton-Armijo algorithm with respect to SSVM 1 is omitted here because it is running the same procedure as that in 2-norm problem.

IV. NUMERICAL RESULTS AND A METHOD FOR
OUTLIER FILTERING As stated in [9], Sigmoid SSVM 1 possesses good outlier resistance, which can be observed in a numerical tests. The first result is represented in Fig. 2 and the corresponding comparison of correctness is in Table I.
As has been already pointed out by Li-Jen Chien, there are some violent cases that are still easy to mislead either Sigmoid SSVM 1 or Sigmoid SSVM 2 to lose their generalization ability. A violent case is presented in Fig.  3, similar with Fig. 1 in [2], in which the positive and negative are normal distribution with mean 2 and -2 respectively and deviation 1. The outlier difference is 75 from the mean and the outlier ratio is 0.025 in positive and negative totally. In this case, no matter Sigmoid SSVM 2 or Sigmoid SSVM 1 , both of them lost efficacy. Why all of the SVMs (Sigmoid SSVM 1 , Sigmoid SSVM 2 , including LIBSVM [10]) lose their generalization ability in this case is that they pay too much effort to minimize the loss term and sacrifice for minimizing the penalty term because of these extreme outliers [2]. Fortunately, Padé SSVM 1 is still robust, and attains the generalization in this violent case.   . For the outliers, the outlier difference from the mean of their groups is set to be 75 times the standard deviation. To eliminate the influence of outliers in such violent case, Li-Jen Chien, Y.J. Lee, Z. P. Kao, and C. C. Chang [2] prescribed a heuristic method to filter out the extreme outliers. In this study, we give another slightly different strategy to filter out the extreme outliers. We would first run the process of SSVM 1 , and then ignore some large ξ i 's. But how to determine the value of ξ i is large enough? We set outlier ratio as our threshold. In our method, the samples whose ξ i 's are over 90 percentage are ignored until the threshold reaches the outlier ratio, and finally we use the rest samples to reconstruct a new SSVM 1 as the final classifier. We denote this outlier filtering method by SSVM 1−o.  The new SSVM 1 constructed by this Padé approximant, i.e., Padé SSVM 1 , has been proved by the theoretical analyses and the numerical results to possess the best outlier resistance compared with previous SSVMs. To strengthen the robustness of SSVMs in some violent cases, a simple method for outlier filtering is proposed. This method for outlier filtering also improves robustness a lot for Sigmoid SSVM 1 and Sigmoid SSVM 2 .