Title: A Hybrid ASR System Using Support Vector Machines
Authors: Aravind Ganapathiraju, Jonathan Hamaker, Joseph Picone

Support  Vector Machines  (SVM) is  a  new class  of machine  learning
technique that learns to  classify discriminatively. This paradigm has
gained  significance in  the past  few years  with the  development of
efficient training  algorithms. SVMs  are based on  the fact  that any
data can  be transformed  into a very  high dimensional  feature space
where it can be classified using a simple hyperplane. Though this task
seems daunting (especially when the  dimension of the feature space is
a few  thousand), the theory of  kernels gives an  elegant solution to
this  problem and  makes it  computationally feasible  even  for large
tasks.  Like neural  network  techniques, SVMs  are implicitly  static
classifiers.   One would  need to  handle the  dynamic nature  of data
using  a hybrid method  built on  a dynamic  model like  hidden Markov
models (HMM).

In  this  research, we  have  developed  a  hybrid HMM/SVM  system  to
recognize  continuous  speech   much  like  the  hybrid  connectionist
systems.  The motivation for this is  based on the fact that HMMs give
us an elegant method to handle the dynamics of speech and SVMs provide
us with powerful classifiers of static data. SVMs are used to generate
the posterior  probability which is  converted to a  scaled likelihood
before it is processed using the standard dynamic programming approach
used in HMMs.  The  mapping from distances to posterior probabilities
can be achieved in several ways.  We have explored the idea of fitting
Gaussians  to each  of the  class conditionals  before we  compute the
posterior. Another approach we have used is to directly estimate a
sigmoid  that does  this mapping.  We  have integrated  the SVMs  into
the  search engine  that comes  with the  publicly available  ISIP ASR
Toolkit.

A novel approach we  are  pursuing  uses  the sufficient  statistics
(Fisher  scores)  generated  by  the HMMs  for  classification.   This
technique is motivated  by the fact that the  Fisher scores encode the
evolution  or dynamics  of the  observation sequence  while  the other
kernels  encode  the static  characteristics  of  a particular  model.
While the choice  of kernel is usually empirical,  Fisher kernels have
the advantage of being based  directly on the sufficient statistics of
the  model.   For  HMMs  these  statistics are  the  gradient  of  the
likelihood with respect to each of the model parameters.

Since dimensionality is not a  problem when optimizing SVMs, one could
classify based on multiple  frames of data like connectionist systems.
with large  vocabulary tasks training  SVMs on frame level  data could
mean  several hundreds  of  thousands of  training  examples for  each
classifier.  to avoid  dealing with  large sets  of training  data, we
could  use segment  level data.  the later  idea has  been  pursued in
preliminary experiments.  Another  interesting fallout of this work
is the ability of  the SVMs  to  identify  mislabeled training  data.
This  is  an important feature  since it  provides us with  a nice way
of handling inaccurate training data especially since training data
for the SVM classifiers is generated by an HMM system that could
introduce errors.

The system has been evaluated on the OGI alphadigits corpus and
performs at 8.9% WER as compared to 11.3% using a 8 mixture crossword
triphone HMM system and 9.8% using a 8 mixture syllable HMM system. We
are presently porting the technology to conversational speech
(SWITCHBOARD). In the next few months, we will have results based on
the Fisher kernel method and would have benchmarked the use of
multi-frame data.