/ Recognition / Fundamentals / Production / Tutorials / Software / Home
4.3.1 Scoring: Error Analysis
Section 4.3.1: Scoring

Evaluating or scoring the performance of speech recognition systems is critical to advances in their design and development. Several evaluation metrics can be used, depending on the complexity of the ASR system to which they are applied. Here we describe a commonly used metric, word error rate (WER). This metric evaluates the number and type of recognition errors made by the decoder.

Scoring the result hypothesized by the decoder requires comparing it to the reference transcription, i.e., a text sequence of words (or other tags) representing what was actually spoken (e.g., the answer). The comparison must be aligned in order to determine the total number and type of errors made by the decoder. An example is given below:

    Decoder Hypothesis:
    Reference Transcription:
    HAUL MOOSE FOR TREES
    CUT TALL SPRUCE TREES
The scoring software supports both a time-aligned mode and a text-alignment mode. The latter has historically been most commonly used, though recent research is increasingly shifting to using time-aligned scoring.

In time-aligned scoring, the hypothesis and reference transcriptions include start and stop times for each word. These are typically generated from a forced alignment process described in Section 4.4. In this case, errors can be easily tabulated because there is a temporal alignment of the two sequences.

However, historically, a text alignment algorithm, also known as a string edit algorithm, has been used to compare the two text sequences. The output of this alignment algorithm, which simply tries to minimize the number of edits required to map the hypothesis onto the reference, is shown below:

  T0 T1 T2 T3 T4
Time-Aligned Hypothesis: *** HAUL MOOSE FOR TREES
Time-Aligned Reference: CUT TALL SPRUCE *** TREES

In this example, the decoder recognized one word correctly ("TREES"). Note that the above alignment might be different than what the recognizer actually produced. However, it is convenient to score this way since it decouples the scoring software from the recognizer output. (We understand this sounds silly, but this is one of those "historical accidents" in speech research.)

To understand how errors are counted and categorized, we must examine each time period, Ti. At time T0, the word "cut" was spoken, but the decoder hypothesized no word. This is considered a deletion error because the decoder removed or deleted a word that was actually spoken. At time T1, the word "tall" was spoken, but the decoder hypothesized "haul". This is considered a substitution error because the decoder substituted the word "haul" for "tall". A similar error was made at time T2. At time T3, the decoder hypothesized the word "for" when nothing was actually spoken. This is considered an insertion error because the decoder inserted a word into silence, where no word was actually spoken.

In summary, the decoder made two substitution errors, one insertion error, one deletion error, and hypothesized one word correctly out of a total of four words actually spoken. For further general discussion of evaluation and scoring, see evaluation metrics from our on-line speech recognition course notes. To learn more about how to score using WER, continue to NIST scoring in the next section.
   
Table of Contents   Section Contents   Previous Page Up Next Page
      Glossary / Help / Support / Site Map / Contact Us / ISIP Home