January / Monthly / Tutorials / Software / Home

We recently released our r00_n10 version of our speech recognition toolkit. This release contains many common features found in modern speech to text (STT) systems: a front end that converts the signal to a sequence of feature vectors, an HMM-based acoustic model trainer, and a time-synchronous hierarchical Viterbi decoder. The new features over last release include:

Phonetic decision tree-based state-tying;
Lexical tree-based N-gram decoding (used to implement cross-word phonetic models);
Annotation graph support;
Significant changes to the I/O interfaces for the recognizer;
Integrated parallel processing support.

Some of these capabilities are described below in greater detail.

Phonetic Decision Trees

Most state-of-the-art Large Vocabulary Conversational Speech Recognition (LVCSR) systems use context-dependent Hidden Markov Models (HMMs) to model speech data. In order to model the variations in speaker characteristics and pronunciations, it is common for an LVCSR system to use several million parameters, which need to be estimated using several hours of speech training data. This explosion in the number of parameters is primarily because of the need to model acoustic units in terms of their context. However, many acoustic contexts are not observed with sufficient frequency in the training data, and therefore estimating model parameters for each context-dependent acoustic unit is difficult. Using discrete or semi-continuous density HMMs, or continuous density HMMs with tied parameters can significantly reduce the total number of model parameters to be estimated.

A phonetic decision tree-based module for phonetic state tying is now integrated in the ISIP STT toolkit. This algorithm uses both the training data as well as phonetically derived questions to cluster the states. It is also capable of handling models with contexts that rarely occur in the training data, if at all. The implementation of this algorithm is based on the Maximum Likelihood (ME) principle, where the trees are grown till a significant increase in likelihood can be achieved. The likelihood computation is based on state occupancy counts produced during HMM training. The current implementation uses occupancy counts based on the Viterbi estimation algorithm. The state tying module has two operating modes:

Training mode: During training, the decision trees are constructed in a top-down fashion by iteratively splitting the leaf nodes using phonetic questions and state occupancy counts. The terminal nodes of the tree represent the tied states.
Testing mode: The tree is used to generate models with unseen contexts. The result of this mode is a set of models corresponding to a user-specified list of context-dependent phones, as well as a list of clustered models.

The user interfaces for the phonetic decision tree include:

location of the phonetic question file:

phonetic_ques_ans_file = "$ISIP_TUTORIAL/train/state_tying/ques_ans_TIdigits.sof";
splitting and merging thresholds:

split_threshold = 100;
merge_threshold = 100;
num_occ_threshold = 600;

Lexical Tree-based N-gram Decoding

The lexical tree-based N-gram decoding is implemented based on a generalized hiearchical search space [1]. The basic idea of lexical tree expansion is to collapse pronunciation models of different words by sharing the same beginning phonemes. The use of a lexical tree significantly reduces the search space and search effort. Based on the idea of sharing the common prefix, we extend the prefix tree representation to any level in the search hierarchy.

Our implementation of lexical tree-based decoding enables users to set the decoder to expand a lexical tree at any level. If a user sets the decoder to use lexical tree decoding at level i, the symbols of the level i will be expand to their corresponding sub-graphs at level i+1 and the common prefix of these symbols will be shared, resulting in a tree-structured search. Lexical tree based decoding is implemented especially for context-dependent phone models, where the search space grows significantly without the lexical tree-based decoder.

Context-dependent phone models, defined as a model which depends on preceding and following sounds, are generally more accurate than context-independent phone models, since the former can capture coarticulatory effects. In our implementation, the concept of a context-dependent model can be used at any level. If a symbol S is context-dependent then the underlying model is determined dynamically via its neighboring symbols. This generalized implementation imposes no restrictions on the length of context and the number of levels using context. Similarly, N-gram models are extended to N-symbol models. The symbol can be a phrase, a word or a phone, depending on the level at which it is used. The decoder does not restrict the order of the N-symbol model. Therefore, users can apply arbitrarily long time-span language or acoustic models to meet the needs of their applications.

The user interfaces for lexical tree-based N-gram decoding include:

lexical tree and N-gram parameters:

@ SearchLevel 0 @
use_lexical_tree = true;
use_nsymbol = true;
lm_scale = 12;
nsymbol_order = 3;
nsymbol_model = "$ISIP_TUTORIAL/decode/lists/tidigits_trigram.sof";
context-dependent models parameters:

@ SearchLevel 1 @
# context dependency parameters
use_symbol_context = true;
left_context_length = 1;
right_context_length = 1;

Annotation Graphs

The annotation graph [2] represents the linguistic annotation of recorded speech data. The linguistic annotation, in the case of speech recognition, is simply an orthographic annotation of speech data, which may or may not be time-aligned to an audio recording. The orthographic annotation, generally referred to as a transcription, is a label associated with the audio recording. The transcription along with the audio recording is used to train the speech recognition system in a supervised learning framework. The annotated transcription may include a hierarchy of linguistic, syntactic and semantic knowledge sources that needs to be conveniently represented.

The annotation graph provides a convenient means for representing a hierarchy of knowledge sources. An annotation graph may be used to represent a single transcription or an entire conversation depending on how the speech database is organized. This alleviates the problem of having multiple copies of the same transcription for each knowledge source, and it also provides an application programmer interface (API) to tag and query the various knowledge sources. The following is an example of how to build an annotation graph that contains the orthographic transcription "the" and its corresponding phones /dh/ and /ax/:

Annotation Graph Example

The annotation graph representation is integrated into our speech recognition system in both training and decoding.The user interfaces for annotation graph include:

In the training stage, the annotation graph is provided in the format of transcription database. Here is an example of parameter file to specify the transcription.

transcription_level = "word";
transcription_database_file = "$ISIP_TUTORIA./database/trans_database_word.sof";
In the decoding stage, the system can be set to output the hypotheses in an annotation graph format. Here is an example of the setting which tells the system to output the annotation graph format. Notice that other hypothesis formats can be easily derived from the annotation graph.

output_levels = "word";
output_mode = "DATABASE";
output_file = "$ISIP_TUTORIAL/decode/xword_tied/loop_grammar.db";

Tutorial Package

Included in this release is a script that guides users through all the steps required to develop a speech recognition system. The experiment in this tutorial is a continuous digit recognition task based on TIDigits. The recognition system is our standard HMM system based on context-dependent cross-word phonetic models and MFCC features.

This tutorial is self-paced. All the files required for this experiment have been bundled with this package. All files and executables in this experiment are assumed to be relative to the $ISIP_TUTORIAL environment variable. In order to run the tutorial you will need to so the following:

Set the $ISIP_TUTORIAL environment variable to the path where the isip_tutorial_v00 directory is located.
Run the shell script located at $ISIP_TUTORIAL/scripts/tutorial.sh.

For example, if you are using the bash shell, you can set the environment variable and run the script using these commands:

The tutorial package and software for this release can be downloaded from here.

Conclusion

This release represents a substantial enhancement to our r00_n09 release. This release delivers most of the functionality expected in a state of the art system, and duplicates most of the functionality in our popular prototype system. We expect the interfaces included in this system will be stable, and will be supported in future releases. The next release of this system, which will primarily deal with efficiency issues, will be r01_n00.

References

B. Jelinek, F. Zheng, N. Parihar, J. Hamaker, and J. Picone, "Generalized Hierarchical Search in the ISIP ASR System," Proceedings of the Thirty-Fifth Asilomar Conference on Signals, Systems, and Computers, vol. 2, pp. 1553-1556, Pacific Grove, California, USA, November 2001.
S. Bird and M. Liberman, "A Formal Framework for Linguistic Annotation," Linguistic Data Consortium, University of Pennsylvania, Philadelphia, Pennsylvania, USA, 2000.