Title: Support Vector Machines For Automatic Data Cleanup
Authors: Aravind Ganapathiraju, Joseph Picone

Support  Vector Machines  (SVM) are  a  new class  of machine  learning
technique that learn to classify discriminatively. This paradigm has
gained  significance in  the past  few years  with the  development of
efficient training  algorithms.

The estimation of the parameters for the SVMs is posed as a quadratic
optimization process. The result is a set of Lagrange multipliers, one
for each training data point. Only a small percent of the training
vectors will have a corresponding non-zero multiplier and these data
points are called Support Vectors. Though a solution for this quadratic
optimization is guaranteed, the number of required computations can
be very high depending on the separability of the data and the number
of training data points. Several heuristics need to be considered to
make SVM training possible is a reasonable amount of time.

One of the most common ways to tackle this problem is to divide the
optimization into sub-problems whose solution can be easily
found. "Chunking" is based on this paradigm where the data is divided
into chunks and the functional is optimized for each chunk. It has
been proved that this algorithm does indeed give the same solution as
a global optimization process but with much less operating memory and
time. The algorithm proceeds as follows:

1) choose a chunk of training points (working set)
2) solve the optimization problem defined by the points in the working
   set.
3) continue until there is no further change in the functional's value

Now the trick is to find a working set (chunk) that helps the
optimization converge fast. The new chunk of data is chosen such that
come constraints are violated the most. For each chunk of data that is
optimized, there will be support vectors with their multipliers at the
upper bound (which indicates that the support vector is in a region
where there is overlap between the classes). These multipliers very
often continue to remain  at the upper bound. When this happens aver
several iterations of the optimization process, it is a good
indication that the data point with the multiplier at the upper bound
is in the overlap region.

The overlap could suggest one of two things - there is inherent
overlap in the data or the data point is a case of mislabeling. Either
way, it is better to not include this data point in the definition of
the classification surface. This unique property of the SVM
optimization process can be used effectively in speech recognition in
several ways. One application is that of data cleanup. Several
databases, especially databases for conversational speech, come with
an inherent transcription word error rate that significantly degrades
training efficacy. Having the capability to identify mislabeled data
in such an environment can be of tremendous help to train acoustic
models effectively. Another area where the above property can be
applied is the area of confidence measures. Several speech recognizers
use confidence measures to guide the training and recognition
processes. We can come up with methods to quantify the degree of
mislabeling based on the above property and use that as a confidence
measure.

Preliminary experiments on highly confusable phone pairs in OGI
Alphadigits indicate that SVMs do a very good job of identifying
mislabeled data and are very consistent. We are in the process of
using this feature in SVMs to identify mislabeled data at the word
level which translates to identifying transcription errors in
databases.