4.3.1 Scoring: Error Analysis
Evaluating or scoring the performance of speech recognition systems is critical to advances in their design and development. Several evaluation metrics can be used, depending on the complexity of the ASR system to which they are applied. Here we describe a commonly used metric, word error rate (WER). This metric evaluates the number and type of recognition errors made by the decoder. Scoring the result hypothesized by the decoder requires comparing it to the reference transcription, i.e., a text sequence of words (or other tags) representing what was actually spoken (e.g., the answer). The comparison must be aligned in order to determine the total number and type of errors made by the decoder. An example is given below:
In time-aligned scoring, the hypothesis and reference transcriptions include start and stop times for each word. These are typically generated from a forced alignment process described in Section 4.4. In this case, errors can be easily tabulated because there is a temporal alignment of the two sequences. However, historically, a text alignment algorithm, also known as a string edit algorithm, has been used to compare the two text sequences. The output of this alignment algorithm, which simply tries to minimize the number of edits required to map the hypothesis onto the reference, is shown below:
In this example, the decoder recognized one word correctly ("TREES"). Note that the above alignment might be different than what the recognizer actually produced. However, it is convenient to score this way since it decouples the scoring software from the recognizer output. (We understand this sounds silly, but this is one of those "historical accidents" in speech research.) To understand how errors are counted and categorized, we must examine each time period, Ti. At time T0, the word "cut" was spoken, but the decoder hypothesized no word. This is considered a deletion error because the decoder removed or deleted a word that was actually spoken. At time T1, the word "tall" was spoken, but the decoder hypothesized "haul". This is considered a substitution error because the decoder substituted the word "haul" for "tall". A similar error was made at time T2. At time T3, the decoder hypothesized the word "for" when nothing was actually spoken. This is considered an insertion error because the decoder inserted a word into silence, where no word was actually spoken. In summary, the decoder made two substitution errors, one insertion error, one deletion error, and hypothesized one word correctly out of a total of four words actually spoken. For further general discussion of evaluation and scoring, see evaluation metrics from our on-line speech recognition course notes. To learn more about how to score using WER, continue to NIST scoring in the next section. |