presented by

Jon Hamaker
Institute for Signal and Information Processing
Mississippi State University, Mississippi State, MS 39762
email: hamaker@cavs.msstate.edu

The lack of freely available state-of-the-art Speech-to-Text (STT) software has been a major hindrance to the development of new audio information processing technology. The high cost of the infrastructure required to conduct state-of-the-art speech recognition research prevents many small research groups from evaluating new ideas on large-scale tasks. From its inception, the Institute for Signal and Information Processing (ISIP) has been dedicated to providing the research community with public-domain software tools for digital information processing via the Internet to facilitate worldwide synergistic development of speech recognition technology.

In this talk, we present the core components of our state-of-the-art Speech-to-Text system: an acoustic processor which converts the speech signal into a sequence of feature vectors; a training module which estimates the parameters for a Hidden Markov Model; a linguistic processor which predicts the next word given a sequence of previously recognized words; and a search engine which finds the most probable word sequence given a set of feature vectors. By far, the most important component of a Speech-to-Text system is the search engine or decoder. The ISIP decoder was designed to be modular and extensible in order to be able to handle a wide variety speech recognition problems (connected digits, studio-quality read speech and spontaneous telephone conversations) in a transparent fashion. The process of moving from a well defined task to a less rigorously defined recognition problem (Spontaneous Speech Recognition, i.e. Switchboard) requires the decoder to have a sophisticated control structure. Hence very few good decoders exist and the best decoders are always considered proprietary.

The ISIP decoder has the capability to compile network grammars, efficiently decode n-gram language models, generate and rescore lattices, generate N-best lists, and perform forced alignments. The decoder is based on a hierarchical Viterbi, breadth-first search tree that supports cross-word context-dependent acoustic models. To maintain an efficient search space, the decoder uses lexical trees to represent the pronunciations of all words and also incorporates beam pruning at the state, phone and word levels and limits the number of active model instances per frame to prevent the evaluation of low-scoring hypothesis. A benchmark evaluation (which does not include such technology as MLLR, VTLN or PLP) conducted on a subset of the Switchboard corpus yielded a WER of 41.8%. This is competitive with commercially available Speech-to-Text systems.

For more information about the ISIP ASR system including all source code and a tutorial, please visit:
http://www.cavs.msstate.edu/research/isip/projects/speech/software/asr/index.html