A Fuzzy Least Squares Support Tensor Machines in Machine Learning

In the machine learning field, high-dimensional data are often encountered in the real applications. Most of the traditional learning algorithms are based on the vector space model, such as SVM. Tensor representation is useful to the over fitting problem in vector-based learning, and tensor-based algorithm requires a smaller set of decision variables as compared to vector-based approaches. We also would require that the meaningful training points must be classified correctly and would not care about some training points like noises whether or not they are classified correctly. To utilize the structural information present in high dimensional features of an object, a tensor-based learning framework, termed as Fuzzy Least Squares support tensor machine (FLSSTM), where the classifier is obtained by solving a system of linear equations rather than a quadratic programming problem at each iteration of FLSSTM algorithm as compared to STM algorithm. This in turn provides a significant reduction in the computation time, as well as comparable classification accuracy. The efficacy of the proposed method has been demonstrated in ORL database and Yale database. The FLSSTM outperforms other tensorbased algorithms, for example, LSSTM, especially when training size is small.


I. INTRODUCTION
With the development of our society, a tremendous amount of data have continuously flooded into our society.The explosive growth of data has generated an urgent need for new techniques and new skills in the data mining areas.Machine learning is an important branch of data mining.In the machine learning areas, the representation of data is one of the core tasks.High dimensional data are often encountered in the real applications.Thus, how to efficiently represent image data has a fundamental effect on the classification.Most of the traditional learning algorithms are based on the vector space [1,2], such as SVM and LSSVM.
However, in practice, images are intrinsically matrices(the second order tensor).A lot of objects need to be expressed in tensor.For example, the gray image and the gray image sequence can be represented by the second order tensor and the third order tensor (the examples can be seen in Fig. 1 and Fig. 2, respectively).To represent the images appropriately, it is important to consider transforming the vector patterns to the corresponding matrix patterns or second order tensors before classification.In this way, it has the following drawbacks: (1) Destroying the data structural information, (2) Leading to high dimen-sional vectors, (3) Occurring over-fitting problem.In other words, some implicit structural or local contextual information may be lost in this transformation.Moreover, the higher the dimension of a vector pattern is, the more space is needed for storing it.In recent years, because of the above three main drawbacks, algorithms based on tensor space have attracted significant interest from the research community.Several algorithms have been extended to deal with tensors, such as support tensor machine (STM) [3][4][5][6][7][8][9], multi-linear principal component analysis (MPCA) [10], multi-linear discriminant analysis (MDA) [11], canonical analysis correlation of tensor (CAC) [12] and nonnegative tensor factorization (NTF) [13].With utilizing the tensor representation, the number of parameters estimated by the tensor-based learning can be greatly reduced.Therefore, the tensor-based learning algorithms are especially suitable for solving the small-sample-size (S3) problem, where the number of samples available for training is small and the number of input features used to represent the data is large.At the same time, involving highdimensional data can also reduce the computational complexity observed in problems.Least Squares support vector machine (LSSVM) was proposed by Suykens and Vandewalle [14], which not only replaces the inequality with the equality in the defining constraint structure of SVM, but also replaces the absolute error measure by the squared error measure in the object function.LSSVM require the solution of a system of linear equations rather than a quadratic programming problem compared with SVM, which leads to an extreme-PAPER A FUZZY LEAST SQUARES SUPPORT TENSOR MACHINES IN MACHINE LEARNING ly fast than traditional SVM.Due to its extraordinary generalization and computational time, LSSVM has been a powerful tool for solving binary classification problems in machine learning.However, in many practical engineering applications community, the training data is often corrupted by those abnormal outliers and noise.Moreover, some sample points are misplaced on the wrong side by accident.In this case, the traditional SVM may not be able to classify the contaminated data correctly.Because due to over fitting, SVM is particularly sensitive to outliers.Fuzzy support vector machine (FSVM) [15][16][17] is an effective method to deal with this problem.In FSVM, each training sampling is associated with a fuzzy membership and different memberships represent different contributions to the learning of decision surface.It can reduce the effects of outliers by fuzzy membership functions.Samples with a higher membership value can be thought of as more representative of that class, while those with a lower membership value should be given less importance.Abe and Inoue proposed FSVM for multiclass problem, which was the extension from the binary classification problem and was applied to multi-class text categorization [18].
In this paper, we propose a novel method called fuzzy Least squares support tensor machine (FLSSTM), which is a tensor version of LSSVM, or fuzzy version of LSSTM.LSSTM is based on the tensor space, which directly accepts order-2 tensors as inputs, without vectorization.Obtaining a classifier in the tensor space not only retains the data structure information, but also helps overcome the overfitting problem encountered mostly in vector-based learning.In comparison to solving a QPP at every iteration of the STM algorithm, FLSSTM solves a system of linear equations in an iterative fashion, which eventually converges to an optimal solution after a few iterations.Its applications include problems involving inputs of higher dimensions, for e.g., image classification and text categorization.
The rest of the paper is organized as follows.Section II provides an overview of LSSVM and Fuzzy membership, the necessary background for proposing FLSSTM in Section III.Section IV demonstrates experimental result.

II. LSSVM AND FUZZY MEMBERSHIP
In this section, we briefly provides a simple introduction about LSSVM and fuzzy membership.

A. LSSVM
Given a set of training samples where The classification problem is modeled by the following programming[14]: where w is a normal vector , C is a regularization pa- rameter.( ) x ! is a nonlinear function which maps i x to a high dimensional feature space, By introducing Lagrange multiples and Applying the KKT Conditions, we obtain !through the following linear equations: Where , I is an identity matrix of appropriate dimension, and e is a vector of ones of appropriate dimension.

B. Fuzzy Membership
In many real-word application community, due to over fitting in SVMs, the training process is particularly sensitive to those abnormal outliers in the training dataset which are far away from their own class.A key difficulty with real dataset is that parts of abnormal outliers are noise, which tends to corrupt the samples.In order to decrease the effect of those outliers or noises, we assign each data point in the training dataset with a membership.Samples with a higher membership value can be thought of as more representative of that class, while those with a lower membership value should be given less importance, so that those abnormal data with a low membership contribute to total error term decreases.
In fact, this fuzzy membership value determines how important it is to classify a data sample correctly.Each data point is given a fuzzy membership and different memberships have different contributions to the learning algorithm and models.So, one of the important things for machine learning is to choose the appropriate fuzzy memberships.The distance between the sample and its class center is used as the basis of measuring the importance of the sample.At present, most of people define a fuzzy membership basing on the distance between each point and its class center [19][20].
Recently, fuzzy SVM have been attracting a lot of interest and fuzzy SVM has been shown to be extremely successful, but few papers discuss fuzzy membership in tensor learning space.It is often that some training points are more important than others in the STM classification problem.With the tensor data training points, we also would require that the meaningful training points must be classified correctly and would not care about some training points like noises whether or not they are classified correctly.
When dealing with binary classification problems by STM, all training tensor are presumed to belong entirely to either positive class or negative class, in other words, they are assumed to have equal weight or relevance.However, in fact, each training point no more exactly belongs to one of the two classes.It may 90% belong to positive class and 10% belong to negative class.This may be associated a fuzzy membership 0 1 i s < ! with each training point i X .Each data point is given a fuzzy membership and different memberships have different contributions to the learning algorithm.How to choose the appropriate fuzzy memberships is very important.The distance between the sample and its class center is used as the basis of measuring the importance of the sample.At present, most of people define a fuzzy membership basing on the distance between each point and its class center.Given a set of training samples where is the fuzzy membership degree of i X belonging to i y .
Denote the mean of positive class as + X and the mean of negative class as - X .Let the radius of positive class and negative class as follows: The fuzzy membership i s is ( ) Where 0 !> is used to avoid the case 0 i s = .

III. FUZZY LEAST SQUARE SUPPORT TENSOR MACHINES
As we all know, the tensor data are also usually contaminated by noise which caused the disturbance and measurement, and those tensor data which near to the class boundary are most affected by noise.In a spirit similar to LSSTM, the standard STM for pattern classification is not robust to noise, then we develop a Fuzzy model of LSSTM in the tensor space model, which is called Fuzzy Least square support tensor machines (FLSSTM).

A. Reformulate LSSTM
Given a set of training samples for classification: .
is the fuzzy membership degree of i X belonging to i y .The classification problem is mod- eled by the following programming: where 0 C > is a regularization parameter and i ! is a slack variable, i s is the membership generalized by some outlier-detecting methods.It should be emphasized that the weights tensor T uv is a rank-one matrix.For solving the optimization problem ( 6), firstly we introduce Lagrange multipliers ( 1, , ) i i l != !and construct the Lagrangian function as follows: , Due to The Lagrangian function (7) can be rewritten as follows: , The KKT necessary conditions for the optimality are: , / , 1, , .
From equations ( 9) and ( 10), it is obviously that u and v rely on each other, and cannot be solved with traditional methods.Like STM, we use alternating iterative algorithm [3,4].
We first fix u .Let Where ( ) , , , ,and e is a vector of ones of appropriate dimension.It can be seen that ( 13) is similar in structure to LSSVM.For solving (13), we consider its Lagrangian as : Applying the KKT necessary and sufficient optimization conditions by equating , , to be equal to 0, we obtain the following: 0 CS! " Substituting the values of , v b and !form (15) ( 16) and ( 17) into the equality constraints of (13) yields the following system of linear equations for obtaining !: Where 1 S ! is a diagonal matrix where 1 ! can be computed using (15)( 16) and ( 17) respectively.

PAPER A FUZZY LEAST SQUARES SUPPORT TENSOR MACHINES IN MACHINE LEARNING
We observe that solving (13) requires the inversion of an l l !matrix at (18), which is non-singular due to the diagonal perturbation introduced by the term 1  n n !matrix for obtaining the value of ! ,this contributes to the significant computational time.
Once v is obtained, Let x i = X i v , On the similar lines, u can be computed by the following QPP: Where !
,Solving ( 19) in a similar fashion as (13).Thus, u and v can be obtained by iterative- ly solving the optimization problems ( 13) and (19).
Algorithmic Procedure: , , , Computing v : v can be computed by solving the following problem: Step3.Computing u : By step 2, let , u can be computed by solving the following problem: Step4.Iteratively computing u and v : By step 2 and step 3, we can iteratively compute u and v until they tend to converge.

IV. EXPERIMENTAL RESULTS
In this section, in order to verify the effectiveness of the proposed FLSSTM, We will compare the results of FLSSTM with the vector-based classification method FLSSVM and tensor-based classification method LSSTM.All the algorithms have been implemented in MATLAB7.8 (R2009a) on Ubuntu running on a PC with system configuration Intel Core5 Duo (2.60GHz) with 2 GB of RAM.The data sets to be used are taken from the UCI Repository.

A. Experiments Preparation
For each size of the training set, ten independent runs are performed and their classification accuracies on the sets are averaged.We randomly select our training images from the entire set and repeat the experiment 10 times.The parameter C is obtained by cross validation, and the range of the regularization constant C is from 6 2 ! to 6 2 with each step by multiplying 2 .All experiments are primarily focused on the second order tensors, namely images in the form of matrix.We initialize both u and v are vectors of ones of appropriate dimensions.At the same time, the kernel function used in LSSTM and FLSSTM are both the linear kernel function, i.e., T ( , ) In our experiments, we choose two databases which are represented in Table I.

B. Experiments on the ORL database
The ORL database [21] of the face images has been provided by AT&T laboratories from Cambridge.It contains 400 images of 40 individuals, with varying lighting, facial expression (open or closed eyes, smiling or not smiling) and facial details (beard or gender, glasses or no glasses).All images were normalized to a resolution of 32!32pixels with (1024) gray levels.Since we are interested in testing the effectiveness of tensor-based algorithms when the dimension of the data is large and the available training set is small, we do not perform cropping or resizing of the images which reduces the number of features in the data and histogram equilibrium was applied in the prepare processing step.The effect of histogram equilibrium on a particular example image is displayed in Fig. 3.

Original Image
Normalized Image training (training ratio is 10%, 20%, 30%, 40%, 50%), while the rest is considered for testing.Table II shows the mean recognition rates and standard deviations of all algorithms in our experiments with different ratio of training sets and test sets on ORL database.From the Table II, it can be seen that when training ratio is small, FLSSTM outperforms FLSSVM and LSSTM, and the advantage of FLSSTM gradually reduced when training set becomes larger.When training ratio is 10%, the maximum difference of the accuracy between FLSSTM and FLSSVM is 7.78%.However, the maximum difference of the accuracy between FLSSTM and LSSTM is only 1.67%.The superiority of tensor-based algorithms gradually reduced when training sets become larger.From table II, it can be seen obviously that the maximum difference of the accuracy between FLSSTM and FLSSVM is 1.56% when training ratio is 50%.

C. Experiments on the Yale database
The Yale database [22] of the face images has been provided by Yale University.It contains 165 images about 15 individuals, where each person has 11 images.These images with varying lighting condition (left-light, centerlight) and facial expression (happy, sad, normal, sleepy, surprised, wink).All images were normalized to a resolution of 100!100 pixels with (10000) gray levels.One object from Yale database is displayed in Fig. 4.  III.For each binary classification experiment, we consider a subset of eleven images of both the subjects for training (training ratio is 10%, 20%, 30%), while the rest is considered for testing.
The efficacy of FLSSTM has been compared with FLSSVM and LSSTM on Yale image classification database.Table IV and table V show the mean recognition rates and standard deviations of all algorithms in our experiments with different size of training sets and test sets on Yale database.From Table IV we can see that the percentage accuracy comparisons for the binary classification.It is evident from the Table IV that FLSSTM roughly outperforms LSSTM in most of the cases.From table IV, it can be seen obviously that the maximum difference of the accuracy between FLSSTM and LSSTM is 6.5% while the minimum difference of the accuracy is only 3.23% for the different subject pairs when the training ratio is 10%.Compared with FLSSVM, the advantage of FLSSTM is also evident, the maximum difference of the accuracy between FLSSTM and FLSSVM is 8.11% while the minimum difference of the accuracy is only 0.83% when the training ratio is 10%.The superiority of FLSSTM algorithms gradually reduced when the training ratio is 30%.In fact, FLSSTM is outperforming LSSTM or FLSSVM especially in cases when the two subjects appear quite dissimilar and the training size is small.
From table IV and table V, it can be seen obviously that the introduction of fuzzy membership improves the classification ability, the results have been provided on the small training size but not the large training size.The main reason lies in less the information for classifying planes when the training size is small, the fuzzy membership can increase the test accuracy when building a classifier.The superiority of fuzzy membership gradually reduced when training sets become larger.We also do the experiments with different database on the computing time.The result of running time comparisons between STM, LSSTM and FLSSTM, for learning a single binary classifier, are showed in Fig. 5 and Fig. 6.From Fig. 5 and Fig. 6 it can be seen that the running time of FLSSTM is significantly less than STM, and is almost the same as LSSTM on both two databases.Results demonstrate that FLSSTM provides a significant reduction in the computational time, as well as comparable classification accuracy.The main reason lies in FLSSTM and LSSTM require solving a series of linear equations rather than a quadratic programming problem as compared to STM algorithm.

V. CONCLUSIONS AND FUTURE WORK
In this paper, we firstly consider the fuzzy membership in the training databases in tensor space, and propose an improved tensor-based method FLSSTM algorithm to learn better from databases in the presence of outliers or noises.For solving the small-sample-size (S3) problems, the tensor representation always performs better than the vector representation.This is due to the fact that the number of parameters estimated by STM is much less than that of estimated by standard SVM.The similar results hold true for the FLSSTM algorithm.The above several numerical experiments show that the tensor-based methods have more advantages than vector-based methods for smallsample-size (S3) problems.
The formulation of FLSSTM requires to solve a system of linear equations at every step of an iterative algorithm, using the alternating projection method similar to STM, in contrast to solving a QPP.This makes FLSSTM a fast tensor-based linear classifier.How to choose a proper fuzzy membership function is quite important to solve classification problem with FPSVM, in future, we will continue to research the method of selecting better fuzzy membership function.Then we will research how to apply our proposed algorithm to large-scale classification problems, and how to use the matricization to improve the classification accuracy is the important issue.

Figure 1 .
Figure 1.A gray image can be represented by a matrix Accuracy and   the Running time are used to estimate the performance of each algorithm.The Accuracy is defined as follows: , FP, FN represent the number of positive data points which are correctly classified, the number of negative data points which are correctly classified, the number of positive data points which are falsely classified, the number of negative data points which are falsely classified, respectively.

Figure 3 .
Figure 3.The effect of applying histogram normalization step on a sample imageSince we consider the binary problem of learning a classifier, we randomly chose two classes images to distinguish.For each binary classification experiment, we consider a subset of ten images of both the subjects for

Figure 4 .
Figure 4. Eleven facial samples from a subject within the Yale databaseWe consider eleven particular examples of binary classification in Yale database.It consists of three subject pairs with similar facial features (smile, beard, glasses), and three subject pairs with distinct facial features.A list of the eleven selected subject pairs is given in TableIII.For each binary classification experiment, we consider a subset of eleven images of both the subjects for training (training ratio is 10%, 20%, 30%), while the rest is considered for testing.The efficacy of FLSSTM has been compared with FLSSVM and LSSTM on Yale image classification database.Table IV and table V show the mean recognition rates and standard deviations of all algorithms in our experiments with different size of training sets and test sets on Yale database.From TableIVwe can see that the percentage accuracy comparisons for the binary classification.It is evident from the Table IV that FLSSTM roughly outperforms LSSTM in most of the cases.From table IV, it can be seen obviously that the maximum difference of the accuracy between FLSSTM and LSSTM is 6.5% while the minimum difference of the accuracy is only 3.23% for the different subject pairs when the training ratio is 10%.Compared with FLSSVM, the advantage of FLSSTM is also evident, the maximum difference of the accuracy between FLSSTM and FLSSVM is 8.11% while the minimum difference of the accuracy is only 0.83% when the training ratio is 10%.The superiority of FLSSTM algorithms gradually reduced when the training

Figure 5 .
Figure 5.The computing time of STM, LSSTM, FLSSTM on ORL data

Figure 6 .
Figure 6.The computing time of LSSTM, FLSSTM on different data

TABLE IV .
MEAN RECOGNITION RATES (%) AND STANDARD DEVIATIONS ON YALE DATABASE