Title: A Hybrid ASR System Using Support Vector Machines Authors: Aravind Ganapathiraju, Jonathan Hamaker, Joseph Picone Support Vector Machines (SVM) is a new class of machine learning technique that learns to classify discriminatively. This paradigm has gained significance in the past few years with the development of efficient training algorithms. SVMs are based on the fact that any data can be transformed into a very high dimensional feature space where it can be classified using a simple hyperplane. Though this task seems daunting (especially when the dimension of the feature space is a few thousand), the theory of kernels gives an elegant solution to this problem and makes it computationally feasible even for large tasks. Like neural network techniques, SVMs are implicitly static classifiers. One would need to handle the dynamic nature of data using a hybrid method built on a dynamic model like hidden Markov models (HMM). In this research, we have developed a hybrid HMM/SVM system to recognize continuous speech much like the hybrid connectionist systems. The motivation for this is based on the fact that HMMs give us an elegant method to handle the dynamics of speech and SVMs provide us with powerful classifiers of static data. SVMs are used to generate the posterior probability which is converted to a scaled likelihood before it is processed using the standard dynamic programming approach used in HMMs. The mapping from distances to posterior probabilities can be achieved in several ways. We have explored the idea of fitting Gaussians to each of the class conditionals before we compute the posterior. Another approach we have used is to directly estimate a sigmoid that does this mapping. We have integrated the SVMs into the search engine that comes with the publicly available ISIP ASR Toolkit. A novel approach we are pursuing uses the sufficient statistics (Fisher scores) generated by the HMMs for classification. This technique is motivated by the fact that the Fisher scores encode the evolution or dynamics of the observation sequence while the other kernels encode the static characteristics of a particular model. While the choice of kernel is usually empirical, Fisher kernels have the advantage of being based directly on the sufficient statistics of the model. For HMMs these statistics are the gradient of the likelihood with respect to each of the model parameters. Since dimensionality is not a problem when optimizing SVMs, one could classify based on multiple frames of data like connectionist systems. with large vocabulary tasks training SVMs on frame level data could mean several hundreds of thousands of training examples for each classifier. to avoid dealing with large sets of training data, we could use segment level data. the later idea has been pursued in preliminary experiments. Another interesting fallout of this work is the ability of the SVMs to identify mislabeled training data. This is an important feature since it provides us with a nice way of handling inaccurate training data especially since training data for the SVM classifiers is generated by an HMM system that could introduce errors. The system has been evaluated on the OGI alphadigits corpus and performs at 8.9% WER as compared to 11.3% using a 8 mixture crossword triphone HMM system and 9.8% using a 8 mixture syllable HMM system. We are presently porting the technology to conversational speech (SWITCHBOARD). In the next few months, we will have results based on the Fisher kernel method and would have benchmarked the use of multi-frame data.