Support Vector Machines

Introduction

The support vector machine (SVM) algorithm is probably the most widely classification method. It is basically a form of linear classifier. We are given the set of points R = \{ (x_1; y_1) \cdots (x_n; y_n)
\} where each y_i \in \{ -1, 1\} is the class label for the i-th point x_i \in R^D. SVM learns a linear discriminant function \hat{y}(x) = sgn \left( w^T \phi(x) + b
\right ) for R where \phi: R^D \rightarrow
\mathcal{H} is a mapping from the input space to a feature space \mathcal{H}. Here the learning implies computing w and b, the linear hyperplane that best separates the training examples with the label -1 and the ones with the label 1. The mapping \phi could be chosen explicitly, or implicitly induced by a kernel function. More details will be available in this section.

The training set is said to be linearly separable when there exists a linear discriminant function whose sign matches the class of all training examples:

\hat{y}(x_i) = y_i, \forall 1 \leq i \leq n

A linearly separable data usually allows an infinite number of separating hyperplanes. Therefore, the hyperplane that maximizes the margin is chosen to predict better on unseen examples. This optimum hyperplane is shown in Figure 1.

_images/svm_hyperplane.png

Figure 1- The optimal hyperplane separates positive and negative examples with the maximal margin. The position of the hyperplane is solely determined by the few examples that are closest to it (the support vectors).

The following optimization problem expresses this choice:

min \:\: \rho (w, b) \: = \: \frac{1}{2}w^{2}

subject\:\:to\:\:\forall i \:\:y_{i}\:(w^{T} \phi  (x_{i}) + b)\: \geq \:1

Directly solving this problem is difficult because the constraints are quite complex. The mathematical tool of choice for simplifying this problem is the Lagrangian duality theory. This approach leads to solving the following dual problem:

max \:\: \varrho(\alpha)\: = \: \sum_{i=1}^{n} \: \alpha_i \:\: - \:\: \frac{1}{2} \: \sum_{i,j=1}^{n} y_i \alpha_i \: y_j \alpha_j \: \phi(x_i)^T \: \phi(x_j)

subject \:\: to \: \: \: \displaystyle \Bigg\{ \substack{  \forall i \:\:\: \alpha_i \: \geq \: 0, \\ \sum_i\:y_i\alpha_i = \: 0 }

The linear discriminant function can then be written as:

\hat{y}(x) \: = \: w_{*}^{T} \: x + b^* \: = \sum_{i=1}^n \: y_i \alpha_i \phi(x_i)^T \phi(x) + b^*

The dual optimization problem and the linear discriminant function only involve the patterns x through the computation of dot products in feature space. There is no need to compute the features phi(x) when one knows how to compute the dot products directly and the resulting optimization problem is convex.

Instead of hand-choosing a feature function \phi(x) , it has been proposed to directly choose a kernel function K(x;
x\prime) that represents a dot product \phi(x)^T
\phi(x\prime). K(x; x\prime) induces a mapping from the original input space to a possibly high-dimensional feature space. Although the corresponding feature space could be infinite-dimensional, all computations can be performed without ever computing a feature vector \phi(x). Complex nonlinear classifiers are computed using the linear mathematics of the optimal hyperplanes and this is referred to as the kernel trick.

Soft margins can be used in case the problem is noisy. The discussion for which is beyond the scope of this document. Please refer to [steinwart2008support], [platt1999fast] for more information.

[steinwart2008support]“Support vector machines”, Steinwart, I. and Christmann, A. Springer Verlag 2008
[platt1999fast]“Fast training of support vector machines using sequential minimal optimization” Platt, J, C. MIT Press 1999

Computational Complexity

The number of support vectors finally produced by the optimization is the critical component of the computational cost of solving the dual problem. Since the asymptotic number of support vectors grows linearly with the number of examples, the computational cost of solving the SVM problem has both a quadratic and a cubic component. It grows at least like n^2 when parameter C is small and n^3 when C is large. Empirical evidence shows that modern SVM solvers come close to these scaling laws.

Computing Kernels is Expensive

  1. Kernels are expensive to compute. For example Gaussian kernel is e^{-\gamma(d^2)} , where d is the distance bewteen 2 observations, is an expensive operation taking many processor cycles.
  2. The size of the kernel matrix is n^2. For even medium sized datasets storing this in memory becomes prohibitive.

SVM Optimization Techniques - Sequential Minimal Optimization

The most successful methods available today fall into the category of decomposition methods. They address the full-scale dual problem by solving a sequence of smaller quadratic programming sub-problems. The method implemented in 1305 is one of the best methods available today called Sequential Minimal Optimization (SMO) proposed by PLatt. The core of this method is that it uses the smallest possible working set size, that is two elements and solves the corresponding optimization problem analytically. This choice dramatically simplifies the decomposition method. The asymptotic convergence and finite termination properties of this particular case of the decomposition method are very well understood.

Speeding up SMO

There are various techniques that can be used to speed up SMO. Some are optimization algorithm independent while others are specific to SMO. These have been implemented in 1305 and are described below.

  1. Working set selection uses second order statistics. This means that more effort is put into selecting which 2 points need to be optimized so as to take bigger steps toward the optimal solution.
  2. Kernel Matrix caching: As mentioned before calculating kernel values is expensive. Kernel values are cached in memory to save re-computation.
  3. Shrinking: The shrinking technique reduces the size of the problem by temporarily eliminating variables that are unlikely to be selected in the SMO working set because they have reached their lower or upper bound. The SMO iterations then continue on the remaining variables. Shrinking reduces the number of kernel values needed to update the gradient vector. The hit rate of the kernel cache is therefore improved.

Choosing the Parameters

Applications of SVMs in business

Support Vector Machines are becoming more and more successful in business applications,[min2005bankruptcy]_, [van-bayesian].

Supported Features

  1. Two-class SVM with different costs per each class.

Planned Extensions

  1. Crossvalidation.
[min2005bankruptcy]“Bankruptcy prediction using support vector machine with optimal choice of kernel function parameters”, Min, J.H. and Lee, Y.C. in Expert Systems with Applicationsm volume 28, 2005
[van-bayesian]“Bayesian Kernel Based Classification for Financial Distress Detection”, Van Gestel, etall, in European Journal of Operational Research, vol 172, 2006.

Running SVM

Available options:

--help Print this information.

--iterations (=-1) REQUIRED file containing data to be clustered.

--tree (=kdtree) Tree structure used by k-means. One of:
kdtree, balltree

--point (=dense) Point type used by SMO. One of: dense, sparse, dense_sparse, categorical, dense_categorical

--results_out the file to store the results

--support_vectors_out the file to store the generated model’s support vectors

--alphas_out the file to store the generated model’s alphas

--parameters_out the file to store the generated model’s parameters viz. C and bias in that order

--support_vectors_in the file containing the support vectors

--alphas_in the file containing the alpha coefficients

--parameters_in the file containing the parameters viz. C and bias in that order

--task train/test/train-test

--c (=-1) svm parameter c

--cost_factor (=1) Ratio by which error on positive cases outweights negative

--sigma sigma for gaussian

--queries_in file containing the test points

--references_in file containing the training points

--wss (=2) working set selection (1 or 2)

--max_cache_memory (=100) maximum memory, in MB, used for cache

--metric (=l2) Metric function used by SMO. One of: l2, weighted_l2

--metric_weights_in A file containing weights for use with --metric=weighted_l2

--dense_sparse_scale (=1) The scaling factor for the sparse part of --point=dense_sparse or --point=dense_categorical for use with --metric=weighted_l2

--log A file to receive the log, or omit for stdout.

--loglevel (=debug) Level of log detail. One of:

  • debug: log everything
  • verbose: log messages and warnings
  • warning: log only warnings
  • silent: no logging

A Simple Example

Assuming you are connected to a cloud machine (see Running 1305 on the Amazon EC2 cloud), the smo algorithm can be run from your home directory. Run the following command on the console

smo_cached --help

This will display all the options available to you and test that you are being able to run the executable. The file format expected is the 1305 file format. Any CSV can easily be converted to the 1305 using a single command and the conversion utils we have provided. You can find these in section: 1305 File Format.

We have provided a sample file for classification. It is called sampleWithLabels50k.txt and contains about 48000 points along with labels in the 1305 format. You will find the samples in your home directory in a folder called samples. To run the classifier on this file you may use the following command

smo_cached --task=train-test
           --support_vectors_out=svs.out    \
           --alphas_out=alphas.out                  \
           --parameters_out=parameters.out  \
           --sigma=1                                \
           --references_in=samples/sampleWithLabels50k.txt  \
           --queries_in=samples/sampleWithLabels50k.txt     \
           --results_out=results.out

The arguments are as follows:

  1. “task” specifies what you want to do. Here we are training and testing at the same time. Other options include “train”, which does not evaluate the SVM just creates a model, and “test” which takes a previously created model and evaluates the test data on it.
  2. Model output files:
  1. “support_vectors_out” is the file to which support vectors are written.
  2. “alphas_out” is the file to which the coefficients of each of the support vectors in the file svs.out will be written.
  3. “parameters_out” is the file to which the parameters such as C, bias, bandwidth etc. are written.

These 3 comprise the model and can be used in the future for testing. 3. “sigma” is the bandwidth for the Gaussian Kernel. It is the only kernel presently supported. 4. “references_in” represents the training data if in “train” or “train-test” mode. 5. “queries_in” represents the testing data if in either “train-test” or “test” mode. 6. “results_out” is the file to which classification results on queries_in is written to when in “test” or “train-test” mode.

Running SVMs on big data

In case your data is too big and you need to train an SVM, you can use the scalable version of svm (Hadoop based Bootstrapped SVM) that uses a cluster of machines.

Suggestions and bug reports

We would like to know your opinion about our product. Please send us bug reports and suggestions at support@analytics1305.com