presented by
Jon Hamaker
Institute for Signal and Information Processing
Mississippi State University, Mississippi State, MS 39762
email: hamaker@cavs.msstate.edu
The lack of freely available state-of-the-art Speech-to-Text (STT)
software has been a major hindrance to the development of new audio
information processing technology. The high cost of the infrastructure
required to conduct state-of-the-art speech recognition research
prevents many small research groups from evaluating new ideas on
large-scale tasks. From its inception, the Institute for Signal and
Information Processing (ISIP) has been dedicated to providing the
research community with public-domain software tools for digital
information processing via the Internet to facilitate worldwide
synergistic development of speech recognition technology.
In this talk, we present the core components of our state-of-the-art
Speech-to-Text system: an acoustic processor which converts the speech
signal into a sequence of feature vectors; a training module which
estimates the parameters for a Hidden Markov Model; a linguistic
processor which predicts the next word given a sequence of previously
recognized words; and a search engine which finds the most probable
word sequence given a set of feature vectors. By far, the most
important component of a Speech-to-Text system is the search engine or
decoder. The ISIP decoder was designed to be modular and extensible in
order to be able to handle a wide variety speech recognition problems
(connected digits, studio-quality read speech and spontaneous
telephone conversations) in a transparent fashion. The process of
moving from a well defined task to a less rigorously defined
recognition problem (Spontaneous Speech Recognition, i.e.
Switchboard) requires the decoder to have a sophisticated control
structure. Hence very few good decoders exist and the best decoders
are always considered proprietary.
The ISIP decoder has the capability to compile network grammars,
efficiently decode n-gram language models, generate and rescore
lattices, generate N-best lists, and perform forced alignments. The
decoder is based on a hierarchical Viterbi, breadth-first search tree
that supports cross-word context-dependent acoustic models. To
maintain an efficient search space, the decoder uses lexical trees to
represent the pronunciations of all words and also incorporates beam
pruning at the state, phone and word levels and limits the number of
active model instances per frame to prevent the evaluation of
low-scoring hypothesis. A benchmark evaluation (which does not
include such technology as MLLR, VTLN or PLP) conducted on a subset of
the Switchboard corpus yielded a WER of 41.8%. This is competitive
with commercially available Speech-to-Text systems.
For more information about the ISIP ASR system including all source code
and a tutorial, please visit:
http://www.cavs.msstate.edu/research/isip/projects/speech/software/asr/index.html