/ Recognition / Fundamentals / Production / Tutorials / Software / Home

4.2.2 Network Decoding: Recognition Using Word Models

In this section, we will focus on speech recognition using word models. Word models are one of many different types of acoustic models that can be used in our recognition system. We will use the recognizer to decode a list of test utterances and will briefly explain the recognition process.

Let's start by decoding a list of test utterances. We'll use utterances from the TIDIGITS subset introduced in Section 2. The features for this subset have already been extracted. Go to the directory $ISIP_TUTORIAL/sections/s04/s04_02_p02/.

cd $ISIP_TUTORIAL/sections/s04/s04_02_p02/

and run the following command:

isip_recognize -param params_decode_ihd.sof -list $ISIP_TUTORIAL/databases/lists/identifiers_test.sof -verbose all

Expected Output:

Command: isip_recognize -parameter_file params_decode.sof -list /ftp/pu./projects/speech/software/tutorials/production/
fundamentals/current/example./databases/lists/identifiers_test.sof -verbose all
Version: 1.23 (not released) 2003/05/21 23:10:45
  
  loading audio database: $ISIP_TUTORIA./databases/db/tidigits_audio_db_test.sof
  
  *** no symbol graph database file was specified ***
  
  *** no transcription database file was specified ***
  
  loading front-end: $ISIP_TUTORIAL/recipes/frontend.sof
  
  loading language model: $ISIP_TUTORIAL/models/word_models/compare/lm_word_jsgf_8mix.sof
  
  loading statistical model pool: $ISIP_TUTORIAL/models/word_models/compare/smp_word_8mix.sof
  
  *** no configuration file was specified ***
  
  opening the output file: $ISIP_TUTORIAL/sections/s04/s04_02_p02/results.out
  
  processing file 1 (ah_111a): $ISIP_TUTORIA./databases/sof_8k/test/ah_111a.sof
    
    hyp:    ONE ONE ONE 
    score:  -8946.990234375   frames: 138
  
  processing file 2 (ah_1a): $ISIP_TUTORIAL/databases/sof_8k/test/ah_1a.sof
    
    hyp:    ONE 
    score:  -5084.52880859375   frames: 79
  
    ....

The console output provides some brief diagnostic information about the results, including the hypothesis for the current utterance, and the (log) likelihood that the hypothesis is correct. (Technically, this is simply a score presented on a log scale that reflects the similarity between the utterances and the best sequence of models that could have produced this score.)

Now, let's briefly examine the components needed to complete the recognition process. There are two console input files, a parameter file and a list of audio utterance identifiers. The first input, a parameter file, is explained in detail in Section 4.2.6. The second item is described in detail in Section 2.4.2. It simply defines a list of utterances to be processed using utterances identifiers.

The parameter file's main purpose is to provide a reference to the three main components of the recognizer: a front end, an acoustic model library, and a hierarchy of language models. You can view the parameter file, params_decode.sof, in your browser. It contains the following information:

@ Sof v1.0 @
@ HiddenMarkovModel 0 @

algorithm = "DECODE";
implementation = "VITERBI";
output_mode = "DATABASE";
output_type = "TEXT";
output_file = "$ISIP_TUTORIAL/sections/s04/s04_02_p02/results.out";
frontend = "$ISIP_TUTORIAL/recipes/frontend.sof";
audio_database = "$ISIP_TUTORIAL/databases/db/tidigits_audio_db_test.sof";
language_model= "$ISIP_TUTORIAL/models/word_models/compare/lm_word_jsgf_8mix.sof";
statistical_model_pool = "$ISIP_TUTORIAL/models/word_models/compare/smp_word_8mix.sof";

This is a text Sof file that contains the essential files to configure and run the recognizer. Algorithm and implementation specify recognition mode (e.g., DECODE), and the type of search algorithm to be used (e.g., VITERBI). The parameters output_file and output_type direct the recognizer to store the results in text format in the file "results.out". The default output format is binary, which is necessary for large-scale experiments. However, we use text mode so we can easily view the results.

The parameter frontend specifies the front end used to convert audio data to features. This process is discussed extensively in Section 3. The recognizer needs this input file so that it can check whether the front end used to generate the acoustic models is compatible with the front end used to generate the features.

The parameter audio_database specifies the audio database to be used to reference the input list, identifiers_test.sof, which contains identifiers, to the correct audio data. Each identifier corresponds to a record in the audio database that provides an audio data file name (in this case a feature file). There is a corresponding entry in the transcription database, which is not used here, that can contain a start and stop time in the audio file that defines the utterance to be processed. This is described in more detail in Section 2.4.2.

The language model file specifies a hierarchy of language models that include a word-level grammar, which controls what sequences of words are allowed, and a mapping of words to acoustic models. This component of the system actually merges acoustic and language modeling into a hierarchy of finite state machines. Acoustic modeling is described in more detail in Section 5; language modeling is described in more detail Section 6. The final parameter, statistical_model_pool, describes a set of statistical models, typically Gaussian mixture models, which represent the terminal nodes in the hierarchy language models, and allow feature vectors to be converted to likelihoods.

The acoustic modeling component of a speech recognition system models the individual sounds in a speech signal. Our recognition system in the configuration demonstrated above is based on Hidden Markov Models (HMMs) which include a temporal component that capture variations of the sound in time. A typical HMM, as shown in the figure to the right, has two components: the underlying statistical model at each state and transition probabilities which model the temporal dimension (variation in time). The underlying statistical models are contained in the statistical model pool and the the topology and transition probabilities are part of the language model file.
For a system that uses word acoustic models, each word is modeled by an HMM. Word models are popular in small vocabulary tasks such as TIDigits where the number of words are small. For TIDigits, there are 11 word models (ONE.....ZERO, OH) that represent each word in the vocabulary. In addition to these 11 word models, there is a model to represent the non-speech portions of the utterance called silence. The process of modeling non-speech acoustics is known as silence modeling. This process is a subject of active research and is immensely challenging when detecting the start and end of utterances in real-time systems.
In addition to the acoustic information, linguistic information is extremely important in recognizing natural speech. The language model component of the speech recognizer describes grammar, which is a set of permissible rules for the structure of a language. The language model is represented in either a graph or text format. Language models can be broadly classified into two types: stochastic and non-stochastic. Good examples of stochastic language models are the N-gram models that are widely used in state-of-the-art recognizers. These language models assign probabilities to certain sequences of words. The probabilities are taken into consideration when determining the sequence of words that were actually spoken. A loop grammar, popular in digits recognition tasks, is an example of a non-stochastic language model as shown in the figure to the right. This type of language model determines all possible word sequences through a graph. The network shown to the right is described in the language model file. The configuration file also is an important part of the specification of the recognition system. Further details of this file, along with editing and tuning instructions, will be explained in Section 4.2.7. Any parameter that can be specified in the configuration file can be overridden from the command line. Further, if no configuration file is specified, the recognizer defaults to a widely-used set of values for key parameters. The example parameter file shown above would produce the same results if the configuration file were deleted from the main parameter file.

Once the results have been acquired by the recognizer, a scoring report can be produced. Scoring is the process of comparing the results from the recognizer to a true transcription of the utterances. The scoring report contains a lot of useful information and statistics. Scoring is explained in detail in Section 4.3.

Glossary / Help / Support / Site Map / Contact Us / ISIP Home