4.1.3 Overview:
Recognition Modes
In the process of producing the best word sequence that represents
an input utterance, continuous speech recognition systems do many other
things that provide very useful intermediate information about the data.
For example, all intermediate symbols used to decode the utterance are
located and time-aligned with the speech data. An example of this is shown
to the right. This time alignment is very useful in an application such
as audio indexing in which users want to browse large archives of
audio information ("find the speech that President Bush gave to the U.N.
last night"). The feature described above is one of many modes
in which the recognition program can operate.
Our recognizer supports two basic search algorithms:
a
Viterbi beam search
and a
stack search.
Within these search algorithms, there are many modes that control
what type of information about the search process is output.
The modes relevant to the
recognition process described in this section are:
-
Network Decoding:
The recognizer can be guided by a state machine that describes
which sequences of symbols (e.g., words) are permissible.
The recognizer searches these networks for the overall best
(e.g., "most probable") symbol sequence. At the very least,
the recognizer can output the corresponding symbols; often, it
outputs a more detailed analysis of each utterance.
-
Forced Alignment:
The recognizer accepts an input transcription for each utterance,
and aligns this transcription with the data. The "optimal"
start and stop times of each symbol in the transcription are
determined automatically.
-
N-Best Generation:
The recognizer can output a list of the N most probable word
sequences. This output format is known as an N-best list, and can be
generated using a modified version of the Viterbi search algorithm.
N-best lists are popular because they can be postprocessed by many
natural language processing tools, and can be reordered based on
their grammatical and semantic content. Again, such an intermediate
format is simply an efficiency compromise since we often can't do a
full search with a large number of natural language constraints.
-
Word Graph Generation:
A recognizer can output many alternative explanations of the data
in addition to the "best" explanation. In fact, state of the art
systems often generate a large file containing many possibilities,
and then rescore these using a more sophisticated system.
Such approaches are known as "multi-pass" systems and are convenient
ways to implement search processes too complex to do in a single
pass. The output format for this intermediate data is often referred
to as a lattice or word graph.
As with most features in the recognizer, these modes are selected
through a parameter in the recognition configuration file.
For example, acoustic training, discussed in
Section 5
is simply implemented as another mode in the recognizer.
|