Kernel discriminant analysis (KDA) is a type of classifier that predicts the label of a given query point. It is also a non-parametric Bayes classifier in which each class is modeled using a non-parametric density estimator, such as the kernel density estimator. KDA can be computed with the help of KDE (see also Kernel Density Estimation). It is particularly useful in scientific applications due to its balance of properties:
- Like all other non-parametric estimators, it gives highly accurate classification results.
- The estimator provides an intuitive method for incorporating prior knowledge by specifying the prior probability (or marginal probability) of each class label.
The inputs and outputs of the kernel discriminant analysis are as follows:
- Input: The set of reference points per each class label
,
, the set of priors
for each clas, and the set of query points for which the labels have to be predicted.
- Output: The labels for each query point.
The optimal classifier on
classes assigns observation
to the class
,
,
which has the greatest posterior probability
; the
posterior probability is the conditional probability of the label of
the query point
being
after
has been
observed. Applying Bayes’ rule yields the following:
We assign the label
to the query poitn
where
where
denotes the probability density of data sampled
from
and
is the the prior probability of
any arbitrary point having the class label
. Note that for
a fixed query point
, the denominator of the second equality
is constant and was subsequently dropped in the last equality.
In place of typically known values of
, KDA uses
kernel density estimates of the form (see also Kernel Density Estimation):
Trained with bandwidth
on a size
of the reference
observations,
belongs to
of class
. The kernel
may have many forms but must be
non-negative and have:
Note: Commonly used kernels are the Gaussian Kernel and the Epanechnikov Kernel.
Proper selection of bandwidths
,
,
is critical for accurate classification results. It is typically most
effective to choose bandwidths that minimize the following zero-one
loss function on the reference set.
Where
is the label provided for the
reference point
and
is
the predicted label computed by excluding the contribution of
itself, replacing
with:
and
. A simple cross-validation method is a
grid-based method that computes
for a set of grid
values.
Although non-parametric methods have always been recognized for the advantage of their accuracy when compared to parametric methods, their computational costs have prevented their adoption. References can be found in: [bult1993semiparametric], [baesens2003benchmarking], [ridgeway1998interpretable] and [thomas2000survey].
The obvious method of performing KDA is to compute the full kernel summation for each query.
For
queries and
references this naive algorithm is
.
The algorithm implementation is trivial but does not scale to even medium-sized datasets.
The Analytics1305 implementation of KDA uses multidimensional spatial trees such as ball-trees and kd-trees to efficiently prune points in the reference set which are too far away from the query at hand to have any (or any significant) effect. For example the epanechnikov kernel which forms a bell shaped density region around any reference point will not have any effect on query points that lie outside this bell shaped region. This fact can be utilized with efficient branch and bound techniques facilitated by multidimensional trees.
These and many other pruning techniques drastically reduce the amortized amount of work done per query point in the classification process. For more information, please refer to [riegel-massive].
| [riegel-massive] | “Massive-Scale Kernel Discriminant Analysis: Mining for Quasars”, Riegel, R. and Gray, A. and Richards, G. SIAM Data Mining 2008 |
| [bult1993semiparametric] | “Semiparametric versus parametric classification models: An application to direct marketing”, Bult, J.R, Journal of Marketing Research, volume 30, 1993. |
| [baesens2003benchmarking] | “Benchmarking state-of-the-art classification algorithms for credit scoring”, Baesens, B, etall, Journal of the Operational Research Society, vol 54, 2003. |
| [ridgeway1998interpretable] | “Interpretable boosted naive Bayes classification”, Ridgeway, G. etall, in Proceedings of the fourth international conference on knowledge discovery and data mining, 1998, |
| [thomas2000survey] | “A survey of credit and behavioural scoring: forecasting financial risk of lending to consumers”, Thomas L.C, in International Journal of Forecasting, vol 16, 2000. |
Available options:
--help Print this information.
--queries_in File(s) containing classification queries. List none to run LOO on class references.
--results_out File(s) to store query file results. List one per query file or one per class if no queries.
--references_in File(s) containing one class’ references. List at least two.
--prior Prior(s) for class inclusion. List none to estimate priors from class sizes or one fewer than classes; last class gets 1 - total.
--bandwidth Kernel bandwidth(s). List one per class or one total to be used by all classes.
--epanechnikov Use the Epanechnikov kernel to estimate densities (default).
--gaussian Use the Gaussian kernel to estimate densities.
--dualtree Use the dualtree method of computation (default).
--singletree Use the singletree method of computation.
--exhaustive Use the exhaustive method of computation.
Subscribe to our newsletter to get an update on the release date