Next: 6. Conclusion Up: Lagrangian Support Vector Machines Previous: 4. LSVM for Nonlinear

5. Numerical Implementation and Comparisons

The implementation of LSVM is straightforward, as shown in the previous sections. Our first ``dry run" experiments were on randomly generated problems just to test the speed and effectiveness of LSVM on large problems. We first used a Pentium III 500 MHz notebook with 384 megabytes of memory (and additional swap space) on 2 million randomly generated points in $R^{10}$ with $\nu=\frac{1}{m}$ and $\alpha=\frac{1.9}{\nu}$ . LSVM solved the problem in 6 iterations in 81.52 minutes to an optimality criterion of on a 2-norm violation of (14). The same problem was solved in the same number of iterations and to the same accuracy in minutes on a 250 MHz UltraSPARC II processor with 2 gigabytes of memory. After these preliminary encouraging tests we proceeded to more systematic numerical tests as follows.

Most of the rest of our experiments were run on the Carleton College workstation ``gray'', which utilizes a 700 MHz Pentium III Xeon processor and a maximum of 2 Gigabytes of memory available for each process. This computer runs Redhat Linux 6.2, with MATLAB 6.0. The one set of experiments showing the checkerboard were run on the UW-Madison Data Mining Institute ``locop2'' machine, which utilizes a 400 MHz Pentium II Xeon and a maximum of 2 Gigabytes of memory available for each process. This computer runs Windows NT Server 4.0, with MATLAB 5.3.1. Both gray and locop2 are multiprocessor machines. However, only one processor was used for all the experiments shown here as MATLAB is a single threaded application and does not distribute any effort across processors [27]. We had exclusive access to these machines, so there were no issues with other users inconsistently slowing the computers down.

The first results we present are designed to show that our reformulation of the SVM (7) and its associated algorithm LSVM yield similar performance to the standard SVM (2), referred to here as SVM-QP. Results are also shown for our active set SVM (ASVM) algorithm [18]. For six datasets available from the UCI Machine Learning Repository [22], we performed tenfold cross validation in order to compare test set accuracies between the methodologies. Furthermore, we utilized a tuning set for each algorithm to find the optimal value of the parameter $\nu$ . For both LSVM and ASVM, we used an optimality tolerance of 0.001 to determine when to terminate. SVM-QP was implemented using the high-performing CPLEX barrier quadratic programming solver [9] with its default stopping criterion. Altering the CPLEX default stopping criterion to match that of LSVM did not result in significant change in timing relative to LSVM, but did reduce test set correctness for SVM-QP.

We also tested SVM-QP with the well-known optimized algorithm SVM $^{light}$ [10]. Joachims, the author of SVM $^{light}$ , graciously provided us with the newest version of the software (Version 3.10b) and advice on setting the parameters. All features for all experiments were normalized to the range as recommended in the SVM $^{light}$ documentation. Such normalization was used in all experiments that we present, and is recommended for the LSVM algorithm due to the penalty on the bias parameter $\gamma$ which we have added to our objective (7). We chose to use the default termination error criterion in SVM $^{light}$ of . We also present an estimate of how much memory each algorithm used, as well as the number of support vectors. The number of support vectors was determined here as the number of components of the dual vector that were nonzero. Moreover, we also present the number of components of larger than 0.001 (referred to in the table as ``SVs with tolerance'').

The results demonstrate that LSVM performs comparably to SVM-QP with respect to generalizability, and shows running times comparable to or better than SVM $^{light}$ . LSVM and ASVM show nearly identical generalization performance, as they solve the same optimization problem. Any differences in generalization observed between the two are only due to different approximate solutions being found at termination. LSVM and ASVM usually run in roughly the same amount of time, though there are some cases where ASVM is noticeably faster than LSVM. Nevertheless, this is impressive performance on the part of LSVM, which is a dramatically simpler algorithm than ASVM, CPLEX, and SVM $^{light}$ . LSVM utilizes significantly less memory than the SVM-QP algorithms as well, though it does require more support vectors than SVM-QP. We address this point in the conclusion.

**Table 1:** LSVM compared with conventional SVM-QP (CPLEX and SVM $^{light}$ ) and ASVM on six UCI datasets. LSVM test correctness is comparable to SVM-QP, with timing much faster than CPLEX and faster than or comparable to SVM $^{light}$ . An asterisk in the first three columns indicates that the LSVM results are significantly different from SVM $^{light}$ at significance level $\alpha =0.05$ . The column ``w/ tolerance'' indicates the number of support vectors with dual variable .
$\begin{table} \begin{center} \epsfxsize =5.7in\leavevmode\epsfbox {datasets.epsi}\end{center}\end{table}$

**Table 2:** Comparison of LSVM with SVM $^{light}$ on the UCI adult dataset. LSVM test correctness is comparable to that of SVM $^{light}$ , but is faster on large datasets. The column ``w/ tolerance'' indicates the number of support vectors with dual variable . ( $\nu =0.03$ )
$\begin{table} \begin{centering} \epsfxsize =5.7in\leavevmode\epsfbox {adult.epsi}\end{centering}\vspace{8mm} \end{table}$

**Table 3:** Performance of LSVM on NDC generated dataset ( $\nu =0.1$ ). SVM $^{light}$ failed on this dataset.
$\begin{table} \begin{centering} \epsfxsize =5.7in\leavevmode\epsfbox {massive.epsi}\end{centering}\end{table}$

We next compare LSVM with SVM $^{light}$ on the Adult dataset [22], which is commonly used to compare SVM algorithms [24,17]. These results, shown in Table 2, demonstrate that for the largest training sets LSVM performs faster than SVM $^{light}$ with similar test set accuracies. Note that the ``iterations'' column takes on very different meanings for the two algorithms. SVM $^{light}$ defines an iteration as solving an optimization problem over a a small number, or ``chunk,'' of constraints. LSVM, on the other hand, defines an iteration as a matrix calculation which updates all the dual variables simultaneously. These two numbers are not directly comparable, and are included here only for purposes of monitoring scalability. In this experiment, LSVM uses significantly more memory than SVM $^{light}$ .

Table 3 shows results from running LSVM on a massively sized dataset. This dataset was created using our own NDC Data Generator [23] as suggested by Usama Fayyad. The results show that LSVM can be used to solve massive problems quickly, which is particularly intriguing given the simplicity of the LSVM Algorithm. Note that for these experiments, all the data was brought into memory. As such, the running time reported consists of the time used to actually solve the problem to termination excluding I/O time. This is consistent with the measurement techniques used by other popular approaches [11,24]. Putting all the data in memory is simpler to code and results in faster running times. However, it is not a fundamental requirement of our algorithm -- block matrix multiplications, incremental evaluations of $Q^{-1}$ using another application of the Sherman-Morrison-Woodbury identity, and indices on the dataset can be used to create an efficient disk based version of LSVM. We note that we tried to run SVM $^{light}$ on this dataset, but after running for more than six hours it did not terminate.

We now demonstrate the effectiveness of LSVM in solving nonlinear classification problems through the use of kernel functions. One highly nonlinearly separable but simple example is the ``tried and true'' checkerboard dataset [8], which has often been used to show the effectiveness of nonlinear kernel methods on a dataset for which a linear separation clearly fails. The checkerboard dataset contains 1000 points randomly sampled from a checkerboard. These points are used as a training set in LSVM to try to reproduce an accurate rendering of a checkerboard. For this problem, we used the following Gaussian kernel:

$\begin{displaymath} K(G,G') = \varepsilon^{-2 \cdot 10^{-4}\Vert G_i-G_j\Vert^2}, i,j=1,\ldots,m \end{displaymath}$

(31)

Our results, shown in Figure 2, demonstrate that LSVM shows similar or better generalization capability on this dataset when compared to other methods [12,19,14]. Total time for the checkerboard problem using LSVM with a Gaussian kernel was 2.85 hours on the Locop2 Pentium II Xeon machine on the 1000-point training set in

after 100,000 iterations. Test set accuracy was 97.0% on a 39,000-point test set. However, within 58 seconds and after 100 iterations, a quite reasonable checkerboard was obtained (http://www.cs.wisc.edu/~olvi/lsvm/check100iter.eps) with a 95.9% accuracy on the same test set.

**Figure 2:** Gaussian kernel LSVM performance on checkerboard training dataset. ( $\nu =10^5$ )
$\begin{figure} \begin{centering} \epsfxsize =3.5in\leavevmode\epsfbox {check.eps}\end{centering}\end{figure}$

The final set of experiments, shown in Table 4, demonstrates the effectiveness of nonlinear kernel functions with LSVM on UCI datasets [22] specialized as follows for purposes of our tests:

The liver-disorders dataset contains 345 points, each consisting of six features. Class 1 contains 145 points, and class -1 contains 200 points.
The letter-recognition dataset is used for recognizing letters of the alphabet. Traditionally, this dataset is used for multi-category classification. We used it here in a two class situation by taking a subset of those points which correspond to the letter ``A,'' and a subset of those points which correspond to the letter ``B.'' This resulted in a dataset of 600 points with 6 features, where each class contained 300 points.
The mushroom dataset is a two class dataset which contains eight categorical attributes. We transformed each categorical attribute into a series of binary attributes, one attribute for each distinct value. For example, this dataset contains an attribute called ``cap surface,'' which can take one one of four categories, namely ``fibrous,'' ``grooves,'' ``scaly,'' or ``smooth.'' We represented this as four binary attributes. A 1 is assigned to the attribute that corresponds to the actual category, and a 0 to the rest. Thus the categorical value ``fibrous'' would be represented by $[1\ 0\ 0\ 0]$ , while the value ``smooth'' would be represented by $[0\ 0\ 0\ 1]$ . A subset of the entire mushroom dataset was used, so as to shorten the running times of our experiments. The final dataset we used contained 22 features with 200 points in class 1 and 300 points in class -1.
The tic-tac-toe dataset is a two class dataset that contains incomplete tic-tac-toe games. All those games with a possible winning move for ``X'' end up in category 1, and all other games end up in category -1. We have intentionally represented this problem with a poor representation scheme to show the power of non-linear kernels in overcoming this difficulty. Each tic-tac-toe game is represented by 9 attributes, where each attribute corresponds to a spot on the tic-tac-toe board. An ``X'' is represented by 1, an ``O'' is represented by 0, and a blank is represented by -1. We used a subset of this dataset with 9 attributes, 200 points in class 1, and 100 points in class -1.

We chose which kernels to use and appropriate values of $\nu$ via tuning sets and tenfold cross-validation techniques. The results show that nonlinear kernels produce test set accuracies that improve on those obtained with linear kernels.

**Table 4:** LSVM performance with linear, quadratic, and cubic kernels (polynomial kernels of degree 1, 2, and 3 respectively). An asterisk in the first three columns indicates that the LSVM results are significantly different from SVM $^{light}$ at significance level $\alpha =0.05$ . The column ``w/ tolerance'' indicates the number of support vectors with dual variable .
$\begin{table} \begin{centering} \epsfxsize =5.7in\leavevmode\epsfbox {nonlinear.epsi}\end{centering}\end{table}$

Next: 6. Conclusion Up: Lagrangian Support Vector Machines Previous: 4. LSVM for Nonlinear

Journal of Machine Learning Research