A New Approach for Remote Sensing Image Sample Selection Based on Convex Theory

—Advancements in remote sensing technology have led to improvements in the acquisition of land cover information. The extraction of accurate and timely knowledge about land cover from remote sensing imagery largely depends on the classification techniques used. Support vector machine has been receiving considerable attention as a promising method for classifying remote sensing imagery. However, the support vector machine learning process typically requires a large memory and significant computation time for treating a large sample set, in which some of the samples might be redundant and useless for the support vector machine model training. Therefore, higher-quality and fewer samples from the sample selection should be utilized for support vector machine-based remote sensing classification. A convex theory-based remote sensing sample selection algorithm for support vector machine classifiers is developed in this work. A Landsat-5 Thematic Mapper imagery acquired on August 31, 2009 (orbit number 113/27) is adopted in our experiments. The study area's land cover/use was divided into five categories. Using the region of interest tool, we select samples from the image of the study area, with each category consisting of 1000 independent pixels. Results show that for most cases, our method can achieve higher classification accuracy than random sample selection method.


INTRODUCTION
Land cover information has been identified as one of the crucial data components for many aspects of global change studies and environmental applications. Remote sensing technology can help obtain land cover information in an easy and timely manner. The remote sensing image classification process is illustrated in Figure 1:  Figure 1, the remote sensing satellite collects earth surface images and transmits them to the data center. By specifying the paths and rows, users can download images from the specified location. Classification algorithm analyzes samples with selected pixels and obtains a remote sensing image classification result. Recently, Support Vector Machine (SVM) has been received increasing attention in the study of remote sensing classification [1,2]. One limitation of SVM is, however, that its training stage takes up large memory and significant computation time, especially in the case of large sample sizes, where some samples may not even be useful for training [3]. Hence, sample selection (i.e. to select the most important samples) plays an important role. Plenty of work for sample selection has been done, including for example based on clustering methods [4,5], Mahalanobis distance [6], !-skeleton and Hausdorffdistance [7,8], and the information theory [9,10]. Although much research progress has been achieved, problems still remain. For a given sample set in a particular application, the majority of existing studies have focused mainly on the acceleration of training speed by minimizing the size of the training sample set; no study has considered the selection of samples with a user specified percentage.
The classification model obtained by SVM is a hyperplane that maximizes the width of the margin between the classes while minimizing the margin of errors [11,12]. Convex optimization theory was applied in the algorithm to train and find a hyper-plane [13]. This training process, in geometric interpretation, is equivalent to finding the nearest points among convex hulls in Hilbert spaces [14,15].The aforementioned research shows that the position of a sample relative to a convex hull (the geometric interpretation of SVM) can play an important role in classification, specifically for identifying the relationship between training samples and SVM classification results. A convex theory-based remote sensing sample selection algorithm (CTRSSSA) for support vector machine classifiers is developed in this work. A Landsat-5 Thematic Mapper imagery acquired on August 31, 2009 (orbit number 113/27) is adopted in our experiments. The study area's land cover/use was divided into five categories. Using the region of interest tool, we select samples from the image of the study area, with each category consisting of 1000 independent pixels. Results show that for most cases, our method can achieve higher PAPER A NEW APPROACH FOR REMOTE SENSING IMAGE SAMPLE SELECTION BASED ON CONVEX THEORY classification accuracy than random sample selection method.

II. SUPPORT VECTOR MACHINE AND ITS GEOMETRIC
INTERPRETATION Support vector machine (SVM) is a supervised classifier which aims to find hyper-planes that separate the dataset with a maximum margin [12]. Given a set of labeled data (x 1 , y 1 ), ..., (x n , y n ), where x i is a multidimensional sample vector and is the class label, the optimization problem associated to the algorithm of SVM can be written as follows [13]: This is a convex quadratic programming equation B=" gives the hard margin case and B<" presents the soft margin case. K(x i , x j ) is the kernel function and the dimension of kernel matrix K=K(x i , x j ) is equal to the square of the number of training samples, thus more time and computer memory in SVM are needed to train a model when the number of samples increases. The process of finding hyper-plane in SVM training is equivalent to finding the nearest points among convex hulls or reduced convex hulls [14,15].

A. Convex theory and distance in Hilbert space
As mentioned above, the position of a sample in a convex can be used in determining the sample's importance for SVM training. Here, convex theory is introduced to find a sample set's convex hull, and based on which, any sample's distance from the convex center in the Hilbert space can be calculated. This distance can be a criterion for sample evaluation and selection.
Definition 1: The set n R C ! is convex if [15] ] (2) Proposition 1: Let C be a nonempty closed convex subset of R n , and let z be a vector in R n . There exists a unique vector that minimizes We can find a unique projection vector in C to obtain a nonzero distance if x convS (5) The norm or distance used by the formulas above is usually represented as a linear product. In general, complex real-world applications require more expressive hypothesis spaces than a linear product. Kernel representations offer an alternative solution by projecting the data into a high dimensional feature space to increase the computational power of SVM. A kernel K(x, z) and a feature map ! into a feature space F satisfying: For the linear separable problems, the kernel function can be expressed directly within the product of two vectors: ; for the linear inseparable problem, SVM adopts a non-linear kernel function (such as: RBF kernel) to map a linearly inseparable problem into a linearly separable one in Hilbert space. The distance between the two vectors in Hilbert space can be represented as follows [16]: From equations (4), (5) and (7), the distance of vector z to a convex hull convS can be represented as a projection: If we can find a group of i ! to make the projection distance between z and convS equal to zero, it means the vector z is inside the convS. If the projection is not equal to zero, it means vector z is outside the convS. This formula can be used as an important criterion to construct the convex hull. Center of mass of the convex set in Hilbert space can be represented as [16]: The map (.) ! may be unknown in most of the kernel functions, so the center vector may cannot be obtained directly, but when the kernel function is given, the distance of a vector x to a set's center can be obtained, based on formulae (7) and (9): This formula can be used to evaluate the distance from a sample vector x to a set center in Hilbert space. Through formula (10) the distance of a vector x to a convex hull convS can be represented as disCenter(x,convS), which is a measure of importance to describe x's position in a convex hull.

B. Algorithms based on convex theory
Basing on convex theory and the distance formula we propose the following four algorithms: Based on field experience and investigation at study area, the study area's land cover/use categories include: Marsh Land (ML), Forestland (FL), Meadow (MD), Farmland (FD) and Water (WT), through Region Of Interest (ROI) tool, we select samples from study area image, samples with each category consisting of 1000 independent pixels, and 1000 samples of each category are further split into two sample sets: 200 samples as the training sample set and 800 samples as the testing sample set. The proposed algorithms are implemented in MATLAB R2011b, and LIBSVM 3.1 with its MATLAB interface adopted as the SVM classifier [17]. To evaluate the effectiveness of CTRSSSA for the sample selection, the proposed method is compared with the random sample selection (RSS) method.
In the experiment, selection percentage P varying from 100% to 1% in step 1% is adopted. Here, P=100% Nomenclature: selection percentage (S%), classification accuracy of CTRSSSA method (C%) and random sample selection method (R%).
The classification accuracy and comparison of two methods are shown in figure 3, 4 and 5:  But along with the decrease in training samples, the classification accuracies of the two methods change in different patterns. For the CTRSSSA method, with a relatively stable and flat declining trend, the decline of the P of sample selection does not directly result in the decrease of classification accuracy (Fig 3). To be specific, compared with the classification accuracy of P=100%, an accuracy of 91.88% can still be achieved with a selection proportion P=55%; the classification is more than 90% accurate until P=22% (44 samples in each category); and 86.18% accuracy can still be reached until P=2% (just 4 samples in each category). The RSS method, however, a rapid classification accuracy decline is seen (the classification accuracy drops below 90% at P=88%). Moreover, the classification accuracy fluctuates remarkably and the smaller the selection percentage P, the more obvious the fluctuation (Fig 4). As we can see from  (Fig 5), despite the fact that RSS's classification accuracy is slightly higher (only 0.55%) than CTRSSSA in P=100, 99, 96, 95, 94, 93, 92, 69, 3 and 2. Such an advantage becomes more and more notable with the decline of P, reaching the maximum (87.95%-70.7% = 17.25%) at P=7%. , RSS is clearly inferior to CTRSSSA. The classification accuracy of RSS is just 70.7%, with many categories being misclassified including Meadow, Farmland and even Water (Fig. 6.d), whereas the classification accuracy of CTRSSSA reaches 87%, with only partial Farmland being misclassified to Forestland and some Marsh to Meadow (Fig. 6.c).  Fig. 7 shows the evaluation results and training value of each pixel on the remote sensing image by assigning the classified pixels with green or gray color. Therein, the dark green color represents the pixels' position in the corresponding category's convex hull, and the corresponding color depth from dark to light reflects the magnitude of training value (i.e., the darker the color, the larger the training value).The gray color represents the pixels outside of the corresponding convex hull, CTBRSSSA tend to select darker green pixels, which are easily misclassified in small sample sizes. Selecting darker green pixels can bring SVM classifier more accuracy when fewer samples are selected.

V. CONCLUSION
SVM is a widely used remote sensing image classifier that is data-dependent, and the quality of its training sample set greatly influences the classification result.
In this paper, convex theory is first introduced into the selection process of the remote sensing training sample to quantitatively describe the relationship or importance between a sample and a convex hull quantitatively. Three algorithms, namely, Is_in_convex, Get_convex_hull, and Sample_evaluation, are designed. Is_in_convex tests whether or not a multi-dimensional vector x is in the convex. Get_convex_hull can obtain the convex hull from a sample set. Our experiments, where a group of samples from 100% to 1% are selected, demonstrate that in most cases, samples that are more valuable can be selected and higher classification accuracy can be achieved by CTRSSSA compared with RSS method.
CTRSSSA is more stable than RSS, with a slower declining trend in classification accuracy along with the decrease in sample selection percentage. This statement is still true even when the number of samples is rather small. Furthermore, the fluctuation trend of CTRSSSA is less severe than that of RSS. With the help of CTRSSSA, users can select fewer and more valuable samples when classifying a remote sensing image and increase the SVM training speed to obtain better classification results.