[IEEE 2011 IEEE International Workshop on Machine Learning for Signal Processing (MLSP) - Beijing,...

6
2011 IEEE International Workshop on Machine Learning for Signal Processing September 18-21, 2011, Beijing, China 978-1-4577-1623-2/11/$26.00 c 2011 IEEE FAST MULTI-CLASS SAMPLE REDUCTION FOR SPEEDING UP SUPPORT VECTOR MACHINES Jingnian Chen, Cheng-Lin Liu National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, 95 Zhongguancun East Road, Beijing 100190, P.R. China {jnchen, liucl}@nlpr.ia.ac.cn ABSTRACT Despite the superior classification performance of support vector machines (SVMs), training SVMs on large datasets is still a challenging problem. Sample reduction methods have been proposed and shown to reduce the training complexity significantly, but more or less trade off the generalization performance. This paper presents an efficient sample reduction method for multi-class classification using one-vs-rest SVMs, called Multi-class Sample Selection (MUSS). For each binary one-vs-rest classification problem, positive samples and negative samples are selected based on the distances from the cluster centers of positive class, assuming that positive samples with large distances from the positive centers and negative samples with small distances from the positive centers are near the classification boundary. The intention of clustering is to improve the computation efficiency of sample selection, other than to select from cluster centers as previous methods did. Experiments on a wide variety of datasets demonstrate the superiority of the proposed MUSS over other competitive algorithms in respect of the tradeoff between reduced sample size and classification performance. The experimental results show that MUSS also works well for binary classification problems. Index Terms—SVM, Multi-class classification, Sample selection, Clustering . 1. INTRODUCTION Support vector machines (SVMs) [1] are a core method for classification. They enjoy strong theoretical foundations and have achieved excellent performance in a wide variety of applications, such as handwritten digit recognition [2], [3], face detection [4], and classification of web pages [5]. Unfortunately, the complexity of training SVMs on datasets . This work was supported in part by the Hundred Talents Program of Chinese Academy of Sciences (CAS) and the National Natural Science Foundation of China (NSFC) under Grant nos. 60775004 and 60825301. with large number of samples is overhigh. If the number of classes of a data set is also large, the training of SVMs will become even more complex and difficult. Consider the ordinary Chinese character dataset consisting of 3755 classes of samples. For the two well-known styles of multi- class SVMs, one versus one (o-v-o) and one versus rest (o- v-r), if the former is adopted, 7048135 SVM models need to be trained, which is intractable in practice. If the latter is utilized, total training samples must be processed to train each of the 3755 models. Furthermore, the training set is typically imbalanced, which may seriously affect the performance of classifiers [6]. Many methods have been developed for speeding up the training of SVMs on large scale datasets. Among them, sample selection is a very effective one. Since the training time of SVM is generally quadratically proportional to the number of training samples, sample selection prior to training can significantly reduce the training time. Furthermore, the decision boundary of SVMs depends on only a part of training samples called support vectors that lie close to the class boundary [7]. Therefore, by selecting samples near the boundary as training data, we can reduce the scale of the training set and speed up the training of SVMs, meanwhile, support vectors are retained and accuracy is kept. Due to the effectiveness of sample selection for speeding up SVM, some methods on this problem have been proposed. One straightforward approach is random down- sampling [8]. This method is simple and efficient. However, it does not consider any data characteristic. So it may not select the most effective training subset. Some sample selection algorithms for SVM are based on clustering. Lyhyaoui et al. [9] performed clustering on each class and searched the nearest cluster center from the opposite class to get samples near the decision boundary. No possible class overlap is supposed, which is unpractical. After that, several clustering-based methods were proposed [10], [11]. Of course, the performance of these methods will depend on the clustering result that is unstable. In recent years several effective approaches for selecting boundary samples were proposed. Panda et al. [12] gave the Concept Boundary Detection algorithm (CBD),

Transcript of [IEEE 2011 IEEE International Workshop on Machine Learning for Signal Processing (MLSP) - Beijing,...

Page 1: [IEEE 2011 IEEE International Workshop on Machine Learning for Signal Processing (MLSP) - Beijing, China (2011.09.18-2011.09.21)] 2011 IEEE International Workshop on Machine Learning

2011 IEEE International Workshop on Machine Learning for Signal ProcessingSeptember 18-21, 2011, Beijing, China

978-1-4577-1623-2/11/$26.00 c©2011 IEEE

FAST MULTI-CLASS SAMPLE REDUCTION FOR SPEEDING UP SUPPORT VECTOR MACHINES

Jingnian Chen, Cheng-Lin Liu

National Laboratory of Pattern Recognition,

Institute of Automation, Chinese Academy of Sciences, 95 Zhongguancun East Road, Beijing 100190, P.R. China

{jnchen, liucl}@nlpr.ia.ac.cn

ABSTRACT

Despite the superior classification performance of support vector machines (SVMs), training SVMs on large datasets is still a challenging problem. Sample reduction methods have been proposed and shown to reduce the training complexity significantly, but more or less trade off the generalization performance. This paper presents an efficient sample reduction method for multi-class classification using one-vs-rest SVMs, called Multi-class Sample Selection (MUSS). For each binary one-vs-rest classification problem, positive samples and negative samples are selected based on the distances from the cluster centers of positive class, assuming that positive samples with large distances from the positive centers and negative samples with small distances from the positive centers are near the classification boundary. The intention of clustering is to improve the computation efficiency of sample selection, other than to select from cluster centers as previous methods did. Experiments on a wide variety of datasets demonstrate the superiority of the proposed MUSS over other competitive algorithms in respect of the tradeoff between reduced sample size and classification performance. The experimental results show that MUSS also works well for binary classification problems.

Index Terms—SVM, Multi-class classification, Sample selection, Clustering.

1. INTRODUCTION Support vector machines (SVMs) [1] are a core method for classification. They enjoy strong theoretical foundations and have achieved excellent performance in a wide variety of applications, such as handwritten digit recognition [2], [3], face detection [4], and classification of web pages [5]. Unfortunately, the complexity of training SVMs on datasets

. This work was supported in part by the Hundred Talents Program of Chinese Academy of Sciences (CAS) and the National Natural Science Foundation of China (NSFC) under Grant nos. 60775004 and 60825301.

with large number of samples is overhigh. If the number of classes of a data set is also large, the training of SVMs will become even more complex and difficult. Consider the ordinary Chinese character dataset consisting of 3755 classes of samples. For the two well-known styles of multi-class SVMs, one versus one (o-v-o) and one versus rest (o-v-r), if the former is adopted, 7048135 SVM models need to be trained, which is intractable in practice. If the latter is utilized, total training samples must be processed to train each of the 3755 models. Furthermore, the training set is typically imbalanced, which may seriously affect the performance of classifiers [6].

Many methods have been developed for speeding up the training of SVMs on large scale datasets. Among them, sample selection is a very effective one. Since the training time of SVM is generally quadratically proportional to the number of training samples, sample selection prior to training can significantly reduce the training time. Furthermore, the decision boundary of SVMs depends on only a part of training samples called support vectors that lie close to the class boundary [7]. Therefore, by selecting samples near the boundary as training data, we can reduce the scale of the training set and speed up the training of SVMs, meanwhile, support vectors are retained and accuracy is kept.

Due to the effectiveness of sample selection for speeding up SVM, some methods on this problem have been proposed. One straightforward approach is random down-sampling [8]. This method is simple and efficient. However, it does not consider any data characteristic. So it may not select the most effective training subset.

Some sample selection algorithms for SVM are based on clustering. Lyhyaoui et al. [9] performed clustering on each class and searched the nearest cluster center from the opposite class to get samples near the decision boundary. No possible class overlap is supposed, which is unpractical. After that, several clustering-based methods were proposed [10], [11]. Of course, the performance of these methods will depend on the clustering result that is unstable.

In recent years several effective approaches for selecting boundary samples were proposed. Panda et al. [12] gave the Concept Boundary Detection algorithm (CBD),

Page 2: [IEEE 2011 IEEE International Workshop on Machine Learning for Signal Processing (MLSP) - Beijing, China (2011.09.18-2011.09.21)] 2011 IEEE International Workshop on Machine Learning

which consists of two steps: concept-independent preprocessing, and concept specific sampling. The first step identifies neighbors of each sample. And the second step determines boundary samples for each concept by computing the score of each sample. CBD shows better effect. However, the score of a sample may be dominated by noisy data very close to it.

Meanwhile, Shin and Cho [13] proposed Neighborhood Property based Pattern Selection algorithm (NPPS) to select samples. The main property used in NPPS is that a boundary sample tends to have more neighbors from various different classes. So, the concept Neighbors Entropy is defined to measure the heterogeneity of neighbors. NPPS can sufficiently use neighborhood property. Its performance, however, is highly dependant on the number of neighbors.

Recently, Angiulli and Astorino [14] proposed the Fast Condensed Nearest Neighbor rule (FCNN) [15] to select samples for SVM. With this rule a training-set-consistent subset is obtained and used to train SVMs. This method avoids deciding the number of neighbors and can sharply reduce samples, but often decreases the accuracy also.

This paper presents a fast sample selection algorithm, Multi-class Sample Selection (MUSS), to select samples from massive multi-class datasets for efficiently training SVM. For each o-v-r SVM model, MUSS selects samples near the boundary prior to the training. The idea of MUSS is simple. For a given class c as positive class, MUSS firstly performs clustering on class c and gets a certain number of cluster centers. Then, MUSS utilizes these centers as reference points to select samples. Specifically, MUSS deletes positive samples that are close to one of these centers and selects negative samples that are near one center. Comparing MUSS with competitive sample selection algorithms, we found empirically its obvious superiority.

The rest of this paper is organized as follows. Section 2 gives details of MUSS algorithm. Section 3 presents experiments. Finally, Section 4 concludes this research with future work.

2. MULTI-CLASS SAMPLE SELECTION This section describes the details of proposed MUSS. At first, its main framework is described. Then, a strategy for determining the parameters of MUSS is presented. At last, the complexity of MUSS is discussed. 2.1. MUSS algorithm Given training set T with L classes and N samples, for a certain class c, ,1 Lc ≤≤ samples from class c are treated as positive, while those from other classes are negative. For the current positive class c MUSS gets the selected result, the set Sc of selected samples, as follows.

At first, a clustering algorithm (e.g. k-means clustering) is performed on class c. Suppose acquired k cluster centers are kMMM ,,, 21 , which will be used as reference points in

the selection process. Then, the distance between each center Mi and every training sample x denoted by ),( xMd i is computed. Utilizing each cluster center Mi as a reference point, we delete positive samples and select negative ones that are closer to Mi (i.e., with smaller distance metrics). The rational behind this operation is that a positive sample near one of the cluster centers will be with high probability an inner point. On the other hand, a negative sample near the boundary will be close to one of the cluster centers of the positive class. So, by this method we can delete redundant positive samples and select boundary negative samples.

It should be pointed out that though MUSS utilizes clustering technology, there are some remarkable differences between our method and other clustering-based sample selection algorithms that are summarized as follows.

Most previous selection algorithms perform clustering on the whole training dataset, while MUSS only on the positive class that is much smaller than the whole training set of a multi-class dataset. And so, MUSS is more efficient.

The purpose of clustering for most clustering-based algorithms is to delete or condense samples, while MUSS performs clustering to get reference points to select boundary negative samples and delete inner positive samples.

To ensure the validity of MUSS, two parameters need to be determined. One is the ratio of selected samples. The other is the number of clusters. Details are discussed in the following subsection. 2.2. Determination of the parameters

We give a simple strategy here for the determination of the ratio of selected samples and the number of clusters.

The presented strategy allows users to determine the ratio according to computational resources available and the size of the training set. Generally, a smaller training set often needs a higher ratio of selected samples and a lower ratio is usually sufficient for a larger training set. With this in mind, we can firstly determine the range of the ratio. For example, we can select a value of the ratio for a medium-scale training set from {0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4}. Then, a proper value of the ratio in the range can be decided with experimental validation. Before this, related issues need to be considered, and are discussed in the following.

Suppose class c ( Lc ≤≤1 ) is the current positive class that contains Nc samples. Two aspects need to be considered to get a suitable value of the ratio r: (1) relation between the number of selected positive samples cn and that of selected negative ones nn ; (2) the way of selecting negative samples: should they be selected from the whole negative class or from each class l ( ,1 Ll ≤≤ cl ≠ )?

For the first aspect, as negative samples are from more than one class, and consequently the negative class boundary is usually more complex and more samples are

Page 3: [IEEE 2011 IEEE International Workshop on Machine Learning for Signal Processing (MLSP) - Beijing, China (2011.09.18-2011.09.21)] 2011 IEEE International Workshop on Machine Learning

needed for its determination, so, nn should be greater than cn . Meanwhile, the possible serious imbalance between class c and the other classes should also be controlled. Both goals can be achieved easily by a rule as adopted in Algorithm MUSS:

If ,3 krNNc +> then ,3rNnc = and thus, ;32 rNnNrn cn =−⋅= otherwise, all positive samples will

be selected, and so, cc Nn = and .cn NNrn −⋅= With this rule, not only the negative class boundary can

be adequately decided, but the possible serious imbalance of training data caused by the o-v-r style can be controlled.

For the second aspect, due to the complexity of the negative class boundary as analyzed in above aspect, we select samples from each class l ( cl ≠ ) to accurately determine this boundary. By this way, the information contained in each class can be utilized. In addition, to avoid the dominance of some larger classes we select the same number of samples from each class l )( cl ≠ .

As to the parameter k, the number of clusters of class c, we found that the performance of MUSS is not sensitive to k when the size of class c is not very small (see Section 3.3). So, we set this parameter according to the size of class c. For example, if Nc is larger than 140, k is set to be 7; if Nc is less than 140 but larger than 70, k is set to be 5; if Nc is less than 70, but larger than 30, k is set to be 3; else, k is set to be 1, just as done in our experiments.

After above parameters are determined, we can select a value of the ratio from a range with experimental validation. The description of MUSS with pseudo codes is given as follows.

Algorithm MUSS Input: training set T with L classes and N samples, the ratio of selected samples r. Output: Sc, the set of selected samples when class c ( Lc ≤≤1 ) is treated as positive and the rest classes as negative. Procedure: 1. for each class c, Lc ≤≤1 2. Sc←{x | x is from class c}; 3. perform k-means clustering on class c and get cluster centers M1, M2, …, Mk; 4. for each center Mi 5. compute d (Mi, x) between Mi and each sample x; 6. if kNrNc +⋅≥ 3

7. search cn~ samples in Sc closer to Mi and delete them from Sc, where kNrNn cc )3(~ ⋅−= 8. end if 9. for each class cl ≠ , Ll ≤≤1 10. search ln samples of class l with least distance metrics and

select them into Sc, where ))1(()( −⋅−⋅= LknNrn cl and

⎩⎨⎧ +⋅≥⋅

=otherwiseN

kNrNNrn

c

cc ,

3,3

11. end for 12. end for 13.end for

2.3 Complexity of MUSS At the end of Section 2, we give a discussion on the complexity of MUSS. The complexity of the clustering performed only on a single class is ),/( LTkNO where N is the size of the training set, L is the number of classes, k is the number of cluster centers and T is the number of cycling in the clustering algorithm. When k is set to be 1, the complexity is reduced to be ).( LNO The cost for computing the distance d(Mi, x) between each cluster center Mi and any sample x is ).(kNO The succeeding main operation of MUSS (step 7 and 10) is to sort samples in each class with the distance metrics in ascending order, and the complexity is ))log(( LNkNO . Thus, MUSS is very efficient.

3. EXPERIMENTS

To demonstrate the effectiveness of proposed MUSS we compare it with three competitive algorithms described in Section 1: CBD, NPPS and FCNN. The aspects to be evaluated in our experiments are (1) the ability of keeping classification accuracy; (2) the ratio of selected samples; (3) the reduction of training time of SVMs; (4) sample selection time and (5) effect of parameters on MUSS. 3.1. Experimental datasets Experiments were performed on six small-scale datasets and six medium or large scale multi-class data sets.

Information about these datasets is described in Table I with the first six being small and the rest being medium or large. For each dataset, five aspects are listed: the size of the dataset, number of attributes (#Att.), number of classes (#cls), size of the training subset (#Trn.) and size of the test subset (#Tes.).

The small-scale data sets are all from UCI repository of machine learning databases. We randomly divided each of them with random seed 1 into training and test subsets between which the ratio of sizes is 4:1. Each of the medium or large-scale datasets was divided into a training set and a test set originally. Among them, Isolet, Letter, Optdigits, and Pendigit are from UCI repository of machine learning databases. USPS (US Postal Service) consists of handwritten digit images, each represented with 256 features. HCL2000 is a Chinese character set collected by Laboratory of Pattern Recognition and Intelligent Systems, Beijing University of Posts and Telecommunications [16]. It includes 3755 classes, and each contains 1000 samples, with 700 for training and 300 for testing. As the training and testing time of SVM on the whole HCL2000 is too long, we only use samples of its first 100 classes in our experiments. 8-direction gradient features [17] are extracted and 512 features are acquired for this dataset.

Page 4: [IEEE 2011 IEEE International Workshop on Machine Learning for Signal Processing (MLSP) - Beijing, China (2011.09.18-2011.09.21)] 2011 IEEE International Workshop on Machine Learning

Table 1. Details of small-scale data sets

SN Data set Size #Att. #Cls #Trn. #Tes. 1 Dermatology 358 34 6 282 76 2 Diabetes 768 8 2 613 155 3 Glass 214 9 6 168 46 4 Ionosphere 351 34 2 279 72 5 Iris 150 4 3 117 33 6 Zoo 101 17 7 76 25 7 HCL2000 100000 512 100 70000 30000 8 Isolet 7797 617 26 6238 1559 9 Letter 20000 16 26 16000 4000 10 Optdigits 5620 64 10 3823 1797 11 Pendigit 10992 16 10 7494 3498 12 USPS 9298 256 10 7291 2007

3.2. Setting of experiments All experiments were performed on an Intel (R) Core (TM) 2 Quad CPU running at 2.83 GHz and 3.25 GB RAM. The training of SVM was performed using SOR algorithm [18] coded with VC++ 6.0

For high-dimensional datasets, HCL2000, Isolet and USPS, the dimension of HCL2000 is reduced by LDA to 99 (one less than the size of its label set), and that of Isolet and USPS are reduced by PCA to 150 and 80 respectively.

In all experiments the Gaussian kernel in (1) is used

( )22 2||||exp),( σjiji xxxxk −−= (1)

On each data set, the optimal setting of 2σ and error tolerance C for SVMs trained on the whole data set was also used to train SVMs on the subset selected by each of these compared algorithms. This is because a good sample selection algorithm should not change the original decision boundary obviously, and the related parameters should be kept. What’s more, this rule can benefit real applications.

For algorithm CBD, the number of nearest neighbors is set to 100, as done in [12]. Another parameter is the ratio of selected samples, which is also needed for MUSS, and it is selected from a range according to the scale of the data set. For small data sets in Table 1, the ratio is selected from 0.1 to 0.7, for medium or large datasets, except HCL2000, it is from 0.1 to 0.4, and for HCL2000, it is from 0.01 to 0.07. For NPPS, the number of nearest neighbors kn needs to be determined. All these parameters are decided on each date set on the base of experimental validation. As FCNN utilizes the nearest neighbor rule, no extra parameter is needed. In addition, as FCNN1 generally achieves higher accuracy than FCNN2 [15], we adopt FCNN1. The parameter settings of ,2σ C and kn on each date set are reported in Table 2.

Table 2. Configurations of parameters

Data sets σ2 C kn Data sets σ2 C kn Dermatology 1 20 8 HCL2000 1 10 50 Diabetes 2 5 5 Isolet 4 10 9 Glass 0.1 0.1 6 Letter 0.15 40 15 Ionosphere 1 10 12 Optdigits 0.5 5 50 Iris 1 5 15 Pendigit 1 15 55 Zoo 1 1 17 USPS 1 7 50

Table 3. Results on small-scale data sets

Algorithm Dermato Diabetes Glass Ionosphe Iris Zoo

All Hi-acc 100 78.06 69.57 91.67 100 100 Tra-t 0.20 0.47 0.17 0.11 0.01 0.03

CBD Hi-acc 98.68 74.19 52.17 90.28 96.97 96.00 Ratio 0.50 0.70 0.35 0.60 0.55 0.70 Tra-t 0.17 0.34 0.03 0.09 0.03 0.03

FCNN Hi-acc 92.11 75.48 56.52 86.11 90.91 100 Ratio 0.33 0.54 0.45 0.18 0.15 0.17 Tra-t 0.11 0.19 0.03 0.01 0.03 0.01

MUSS Hi-acc 98.68 78.71 65.22 94.44 100 100 Ratio 0.44 0.65 0.52 0.55 0.43 0.20 Tra-t 0.08 0.28 0.08 0.06 0.01 0.02

NPPS Hi-acc 94.74 76.13 58.70 91.67 63.64 100 Ratio 0.55 0.63 0.63 0.24 0.32 0.64 Tra-t 0.14 0.20 0.08 0.03 0.01 0.03

Table 4. Results on medium or large-scale data sets

Algorithm HCL2000 Isolet Letter Optdigit Pendigit USPS

All Hi-acc 99.29 96.34 97.97 98.94 98.80 95.81 Tra-t 25173.94 290.97 2139.5 65.03 36.22 197.78

CBD

Hi-acc 99.19 95.64 97.90 98.83 98.74 95.52 Ratio 0.07 0.40 0.40 0.40 0.40 0.30 Tra-t 2255.36 107.81 827.38 25.17 17.67 46.22 Sele-t 48.30 0.97 5.69 0.19 0.49 0.72

FCNN

Hi-acc 99.02 95.77 95.67 97.61 96.80 94.67 Ratio 0.07 0.30 0.18 0.09 0.05 0.12 Tra-t 7616.83 76.80 228.06 4.23 2.06 14.53 Sele-t 6869.90 22.19 54.28 2.86 1.47 4.55

MUSS

Hi-acc 99.30 96.09 97.40 99.05 98.46 95.62 Ratio 0.07 0.40 0.35 0.25 0.40 0.34 Tra-t 1456.25 89.99 596.42 9.75 12.73 43.42 Sele-t 60.98 1.97 0.87 0.17 0.16 0.59

NPPS

Hi-acc 98.28 94.74 86.42 97.83 98.34 95.57 Ratio 0.17 0.55 0.47 0.41 0.29 0.47 Tra-t 3378.56 116.86 713.03 25.22 12.36 78.49 Sele-t 1087.50 5.88 18.42 6.50 5.05 11.52

Figure 1. Accuracy of compared methods on all experimental datasets.

Figure 2. Ratio of compared methods on all experimental datasets.

Page 5: [IEEE 2011 IEEE International Workshop on Machine Learning for Signal Processing (MLSP) - Beijing, China (2011.09.18-2011.09.21)] 2011 IEEE International Workshop on Machine Learning

3.2. Experimental results We report the results of our experiments on small-scale datasets in Table 3 and those on medium or large sets in Table 4. Table 3 lists the highest accuracy, corresponding ratio of selected samples and training time of SVM by using these compared algorithms on each small dataset. Besides above three aspects Table 4 also gives sample selection time on each medium or large dataset. As the selection time on small sets is very little, Table 3 omits this aspect. Here, “Hi-acc”, “Tra-t”, “Sle-t” and “All” denote “highest accuracy”, “training time”, “sample selection time” and “learning SVMs with all training samples” respectively. Furthermore, in order to give an overview on the accuracy and selection ratio, Figure 1 and Figure 2 describe these two aspects on all experimental datasets.

It can be seen From Figure 1, Table 3 and 4 that on most datasets MUSS performs the best in keeping classification accuracy among the compared sample selection algorithms. On Diabetes, HCL2000, Ionosphere and Optdigit the accuracy is even improved by MUSS. CBD follows our MUSS in this aspect and gets better results than FCNN and NPPS, though it gets the obviously lower accuracy on Glass. By contrast, FCNN performs not well in keeping classification accuracy, especially on Dermatology, Glass and Iris. NPPS often reduces accuracy as well, especially on Dermatology, Glass, Iris and Letter.

From the ratio of selected samples and training time described by Figure 2, Table 3 and 4 we can see that all the compared sample selection algorithms can remarkably reduce the training samples and training time. By contrast, FCNN performs the best on most datasets in these two aspects, but it often results in obvious loss of accuracy. Our MUSS algorithm follows FCNN and obviously superior to CBD and NPPS in reducing training data and training time.

The sample selection time listed in Table 4 shows that all the sample selection algorithms run fast. By comparison, our MUSS runs the fastest and CBD follows it closely. FCNN and NPPS run obviously slower. From this point we can see that FCNN performs the best on reducing the training time mainly due to its lower sample selection ratio.

Considering above compared results, above all, MUSS can keep or even improve accuracy on most data sets, while other algorithms can not, especially, FCNN and NPPS often result in obvious loss of accuracy, as a whole MUSS performs the best on the experimental data sets.

3.3. Effects of parameters on MUSS To observe the effects of the ratio of selected samples r and the number of clusters k on MUSS, we perform experiments on a typical data set Optdigit. Table 5 lists the accuracy and training time on Optdigit when the number of clusters k is 7 and the selection ratio r varies from 0.1 to 0.5, and Table 6

lists the accuracy, training time and selection time when r takes 0.25 and k varies from 1 to 9.

Table 5. Results of MUSS on Optdigit when k is 7 and r varies

Ratio r 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Accuracy 97.38 98.61 99.00 99.05 98.94 98.78 98.83 98.89 98.94Tra-time 3.09 5.05 7.63 9.75 13.11 15.80 18.23 21.08 25.25

Table 6. Results of MUSS on Optdigit when r is 0.25 and k varies

k 1 2 3 4 5 6 7 8 9 Accuracy 98.89 98.83 98.66 98.94 98.94 99.05 99.05 98.94 98.94Tra-time 8.91 8.83 8.44 9.08 9.83 10.06 9.75 9.41 9.63 Sel-time 0.062 0.030 0.091 0.141 0.159 0.157 0.174 0.264 0.218

Form Table 5 we can see that with the increase of ratio r, the training time also increases. The accuracy, however, does not always increase. But its change is not violent. When r takes 0.25, the accuracy reaches the maximum. In our experiments we found that when k takes some value from a candidate set predetermined according to the scale of the training set, MUSS can achieve terrific results. So, we can easily determine k with experimental validation.

The results in Table 6 show that when the number of clusters k increases, the sample selection time changes in approximately ascending trends, and the classification accuracy changes smoothly and reaches the maximum when k takes 6 or 7. So, the performance of MUSS is not sensitive to the parameter k, especially when the size of the training set is not very small. Thus, we can decide k easily. In our experiments we decide it according to the scale of the training set, just as described in Section 2.2.

4. CONCLUSION This paper presents a sample selection method, MUSS, to select boundary samples as training data to substantially reduce the scale of a multi-class dataset and effectively speed up the training of SVM models. With the one versus rest (o-v-r) style of multi-class SVMs, MUSS gets a set of selected samples for each o-v-r SVM model. For a certain class c as positive class, MUSS gets a predetermined number of cluster centers of class c by some clustering algorithm. And then, with these centers as reference points MUSS can sharply and efficiently reduce the scale of the training set. Meanwhile, the possible serious imbalance of training data caused by o-v-r style can be controlled. Experiments performed on a wide variety of datasets show that MUSS outperforms other competitive algorithms in keeping classification accuracy. On most data sets it can keep or even improve classification accuracy while sharply reducing the scale of data sets and remarkably speeding up the training of SVMs. Experiments also show that MUSS can perform superiorly well on binary data sets, though it is oriented to multi-class datasets. Furthermore, the parameters in MUSS algorithm can be easily determined with our presented strategy.

Page 6: [IEEE 2011 IEEE International Workshop on Machine Learning for Signal Processing (MLSP) - Beijing, China (2011.09.18-2011.09.21)] 2011 IEEE International Workshop on Machine Learning

It should be pointed out that, like most of other sample selection algorithms, MUSS mainly accelerates the training process of SVMs, though it can also speedup classification additionally due to the reduction of support vectors caused by it. In future work, we will combine MUSS with other techniques to improve the classification efficiency of SVMs and apply it to massive practical problems with large number of classes. REFERENCES [1] V.N. Vapnik, The nature of statistical learning theory, Springer, New York ,1995. [2] D. DeCoste, B. Schölkopf, N. Cristianini, “Training invariant support vector machines,” Machine Learning, vol. 46, no.1-3, pp.161-190, 2002. [3] J. Dong, A. Krzyzak, C. Y. Suen, “Fast SVM training algorithm with decomposition on very large data sets,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no4, pp.603-618, 2005. [4] E.Osuna, R. Freund, F.Girosi, “Training support vector machines: An application to face detection,” in Proc. CVPR’97, pp.130-136. [5] T. Joachims, “Text categorization with support vector machine: learning with many relevant features,” in Proc. of 10th European Conf. Machine Learning, 1998, pp.137-142. [6] N. V. Chawla, N. Japkowicz, A. Kotcz, Editorial to the special issue on learning from imbalanced data sets, ACMSIGKDD Explorations, vol. 6, no. 1, 2004, pp.1–6. [7] C. Burges, “Geometry and invariance in kernel based methods,” Advances in Kernel Methods: Support Vector Learning, MIT Press, 1999, pp.89-116. [8] L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, pp.123-140, 1996.

[9] A. Lyhyaoui, M. Martınez, I. Mora, et.al, “Sample selection via clustering to construct support vector-like classifiers,” IEEE Trans. on Neural Networks, vol. 10, no.6, pp.1474-1481, 1999. [10] M. B. Almeida, A. P. Braga and J. P. Braga, “SVM-KM: Speeding SVMs learning with a priori cluster selection and k-means,” in Proc. of 6th Brazilian Symposium on Neural Networks, 2000, pp. 162-167. [11] R. Koggalage and S. Halgamuge, “Reducing the number of training samples for fast support vector machine classification,” Neural Information Processing Letters and Reviews, vol. 2, no.3, pp. 57-65, 2004. [12] N. Panda, E. Y. Chang, G. Wu, “Concept boundary detection for speeding up SVM,” in Proceedings of International Conference on Machine learning, 2006, pp.681-688. [13] H. Shin and S. Cho, “Neighborhood property based pattern selection for support vector machines,” Neural Computation, vol.19, no.3, pp. 816-855, 2007. [14] F. Angiulli and A. Astorino, “Scaling up support vector machines using nearest neighbor condensation,” IEEE Transactions on Neural Networks, vol.21, no.2, pp.351-357, 2010. [15] F. Angiulli, “Fast nearest neighbor condensation for large data sets classification,” IEEE Transactions on Knowledge and Data Engineering, vol.19, no.11, pp.1450-1464, 2007. [16] H. Zhang, J. Guo, G. Chen, C. Li, “HCL2000—A Large-scale Handwritten Chinese Character Database for Handwritten Character Recognition,” in Proc. of 10th International Conference on Document Analysis and Recognition, 2009, pp.286—289.

[17] C.-L. Liu, K. Nakashima, H. Sako, H. Fujisawa, “Handwritten digit recognition: investigation of normalization and feature extraction techniques,” Pattern Recognition, vol.37, no.2, pp. 265-279, 2004. [18] O. L. Mangasarian, and D. R. Musicant, “Successive overrelaxation for support vector machines,” IEEE Transactions on Neural Networks, vol.10, no.5, pp.1032–1037, 1999.