The support vector machine (SVM) algorithm is probably the most widely
classification method. It is basically a form of linear classifier. We
are given the set of points
where each
is the class label for the
-th point
. SVM learns a linear
discriminant function
for
where
is a mapping from the input space to a feature space
. Here the learning implies computing
and
, the linear hyperplane that best separates the training
examples with the label -1 and the ones with the label 1. The mapping
could be chosen explicitly, or implicitly induced by a
kernel function. More details will be available in this section.
The training set is said to be linearly separable when there exists a linear discriminant function whose sign matches the class of all training examples:
A linearly separable data usually allows an infinite number of separating hyperplanes. Therefore, the hyperplane that maximizes the margin is chosen to predict better on unseen examples. This optimum hyperplane is shown in Figure 1.
Figure 1- The optimal hyperplane separates positive and negative examples with the maximal margin. The position of the hyperplane is solely determined by the few examples that are closest to it (the support vectors).
The following optimization problem expresses this choice:
Directly solving this problem is difficult because the constraints are quite complex. The mathematical tool of choice for simplifying this problem is the Lagrangian duality theory. This approach leads to solving the following dual problem:
The linear discriminant function can then be written as:
The dual optimization problem and the linear discriminant function
only involve the patterns x through the computation of dot products in
feature space. There is no need to compute the features
when one knows how to compute the dot products directly and the
resulting optimization problem is convex.
Instead of hand-choosing a feature function
, it has
been proposed to directly choose a kernel function
that represents a dot product
.
induces a mapping from the
original input space to a possibly high-dimensional feature
space. Although the corresponding feature space could be
infinite-dimensional, all computations can be performed without ever
computing a feature vector
. Complex nonlinear
classifiers are computed using the linear mathematics of the optimal
hyperplanes and this is referred to as the kernel trick.
Soft margins can be used in case the problem is noisy. The discussion for which is beyond the scope of this document. Please refer to [steinwart2008support], [platt1999fast] for more information.
| [steinwart2008support] | “Support vector machines”, Steinwart, I. and Christmann, A. Springer Verlag 2008 |
| [platt1999fast] | “Fast training of support vector machines using sequential minimal optimization” Platt, J, C. MIT Press 1999 |
The number of support vectors finally produced by the optimization is the critical component of the computational cost of solving the dual problem. Since the asymptotic number of support vectors grows linearly with the number of examples, the computational cost of solving the SVM problem has both a quadratic and a cubic component. It grows at least like
when parameter
is small and
when
is large. Empirical evidence shows that modern SVM solvers come close to these scaling laws.
, where d is the distance bewteen 2 observations, is an expensive operation taking many processor cycles.
. For even medium sized datasets storing this in memory becomes prohibitive.The most successful methods available today fall into the category of decomposition methods. They address the full-scale dual problem by solving a sequence of smaller quadratic programming sub-problems. The method implemented in 1305 is one of the best methods available today called Sequential Minimal Optimization (SMO) proposed by PLatt. The core of this method is that it uses the smallest possible working set size, that is two elements and solves the corresponding optimization problem analytically. This choice dramatically simplifies the decomposition method. The asymptotic convergence and finite termination properties of this particular case of the decomposition method are very well understood.
There are various techniques that can be used to speed up SMO. Some are optimization algorithm independent while others are specific to SMO. These have been implemented in 1305 and are described below.
Support Vector Machines are becoming more and more successful in business applications,[min2005bankruptcy]_, [van-bayesian].
- Two-class SVM with different costs per each class.
- Crossvalidation.
| [min2005bankruptcy] | “Bankruptcy prediction using support vector machine with optimal choice of kernel function parameters”, Min, J.H. and Lee, Y.C. in Expert Systems with Applicationsm volume 28, 2005 |
| [van-bayesian] | “Bayesian Kernel Based Classification for Financial Distress Detection”, Van Gestel, etall, in European Journal of Operational Research, vol 172, 2006. |
--help Print this information.
--iterations (=-1) REQUIRED file containing data to be clustered.
--point (=dense) Point type used by SMO. One of: dense, sparse, dense_sparse, categorical, dense_categorical
--results_out the file to store the results
--support_vectors_out the file to store the generated model’s support vectors
--alphas_out the file to store the generated model’s alphas
--parameters_out the file to store the generated model’s parameters viz. C and bias in that order
--support_vectors_in the file containing the support vectors
--alphas_in the file containing the alpha coefficients
--parameters_in the file containing the parameters viz. C and bias in that order
--task train/test/train-test
--c (=-1) svm parameter c
--cost_factor (=1) Ratio by which error on positive cases outweights negative
--sigma sigma for gaussian
--queries_in file containing the test points
--references_in file containing the training points
--wss (=2) working set selection (1 or 2)
--max_cache_memory (=100) maximum memory, in MB, used for cache
--metric (=l2) Metric function used by SMO. One of: l2, weighted_l2
--metric_weights_in A file containing weights for use with --metric=weighted_l2
--dense_sparse_scale (=1) The scaling factor for the sparse part of --point=dense_sparse or --point=dense_categorical for use with --metric=weighted_l2
--log A file to receive the log, or omit for stdout.
--loglevel (=debug) Level of log detail. One of:
- debug: log everything
- verbose: log messages and warnings
- warning: log only warnings
- silent: no logging
Assuming you are connected to a cloud machine (see Running 1305 on the Amazon EC2 cloud), the smo algorithm can be run from your home directory. Run the following command on the console
smo_cached --help
This will display all the options available to you and test that you are being able to run the executable. The file format expected is the 1305 file format. Any CSV can easily be converted to the 1305 using a single command and the conversion utils we have provided. You can find these in section: 1305 File Format.
We have provided a sample file for classification. It is called sampleWithLabels50k.txt and contains about 48000 points along with labels in the 1305 format. You will find the samples in your home directory in a folder called samples. To run the classifier on this file you may use the following command
smo_cached --task=train-test --support_vectors_out=svs.out \ --alphas_out=alphas.out \ --parameters_out=parameters.out \ --sigma=1 \ --references_in=samples/sampleWithLabels50k.txt \ --queries_in=samples/sampleWithLabels50k.txt \ --results_out=results.out
The arguments are as follows:
- “support_vectors_out” is the file to which support vectors are written.
- “alphas_out” is the file to which the coefficients of each of the support vectors in the file svs.out will be written.
- “parameters_out” is the file to which the parameters such as C, bias, bandwidth etc. are written.
These 3 comprise the model and can be used in the future for testing. 3. “sigma” is the bandwidth for the Gaussian Kernel. It is the only kernel presently supported. 4. “references_in” represents the training data if in “train” or “train-test” mode. 5. “queries_in” represents the testing data if in either “train-test” or “test” mode. 6. “results_out” is the file to which classification results on queries_in is written to when in “test” or “train-test” mode.
In case your data is too big and you need to train an SVM, you can use the scalable version of svm (Hadoop based Bootstrapped SVM) that uses a cluster of machines.
We would like to know your opinion about our product. Please send us bug reports and suggestions at support@analytics1305.com