/ Recognition / Fundamentals / Production / Tutorials / Software / Home

4.1.3 Overview: Recognition Modes

In the process of producing the best word sequence that represents an input utterance, continuous speech recognition systems do many other things that provide very useful intermediate information about the data. For example, all intermediate symbols used to decode the utterance are located and time-aligned with the speech data. An example of this is shown to the right. This time alignment is very useful in an application such as audio indexing in which users want to browse large archives of audio information ("find the speech that President Bush gave to the U.N. last night"). The feature described above is one of many modes in which the recognition program can operate.

Our recognizer supports two basic search algorithms: a Viterbi beam search and a stack search. Within these search algorithms, there are many modes that control what type of information about the search process is output. The modes relevant to the recognition process described in this section are:

Network Decoding: The recognizer can be guided by a state machine that describes which sequences of symbols (e.g., words) are permissible. The recognizer searches these networks for the overall best (e.g., "most probable") symbol sequence. At the very least, the recognizer can output the corresponding symbols; often, it outputs a more detailed analysis of each utterance.
Forced Alignment: The recognizer accepts an input transcription for each utterance, and aligns this transcription with the data. The "optimal" start and stop times of each symbol in the transcription are determined automatically.
N-Best Generation: The recognizer can output a list of the N most probable word sequences. This output format is known as an N-best list, and can be generated using a modified version of the Viterbi search algorithm. N-best lists are popular because they can be postprocessed by many natural language processing tools, and can be reordered based on their grammatical and semantic content. Again, such an intermediate format is simply an efficiency compromise since we often can't do a full search with a large number of natural language constraints.
Word Graph Generation: A recognizer can output many alternative explanations of the data in addition to the "best" explanation. In fact, state of the art systems often generate a large file containing many possibilities, and then rescore these using a more sophisticated system. Such approaches are known as "multi-pass" systems and are convenient ways to implement search processes too complex to do in a single pass. The output format for this intermediate data is often referred to as a lattice or word graph.

As with most features in the recognizer, these modes are selected through a parameter in the recognition configuration file. For example, acoustic training, discussed in Section 5 is simply implemented as another mode in the recognizer.

Glossary / Help / Support / Site Map / Contact Us / ISIP Home