Overview Downloads  Publications Performance Collaboration
Aurora Evaluations
Overview

Explanation | Baseline Performance | Without Compression | With Compression

On this web page we present a summary of the results for several training and testing conditions being evaluated in the Aurora competition. Detailed explanations of the baseline system and the corresponding evaluation conditions are available on-line. In these evaluations, we will test the following conditions using the DARPA Wall Street Journal (WSJ) Corpus:
  • Additive Noise: six noise conditions collected from street traffic, train stations, cars, babble, restaurants and airports will be digitally added to the speech data to simulate degradations in the signal-to-noise ratio of the channel.

  • Sample Frequency Reduction: the reduction in accuracy due to decreasing the sample frequency from 16kHz to 8kHz will be calibrated.

  • Microphone Variation: performance for both microphone conditions contained in the WSJ0 corpus will be analyzed.

  • Compression: degradations due to data compression of the feature vectors will be evaluated.

  • Model Mismatch: degradations due to mismatch between training and evaluation
The baseline system to be used for the Aurora evaluations is based on a public domain speech recognition system that has been under development at the Institute for Signal and Information Processing (ISIP) at Mississippi State University for several years. This system is implemented entirely in C++ and is fairly modular and easy to modify. It has been used on several evaluations conducted by NIST and the Naval Research Laboratory. We use hidden Markov model based context-dependent acoustic models, lexical trees for cross-word acoustic modeling, N-gram language models with backoff probabilities for language modeling (finite state networks are also supported), and a tree-based lexicon for pronunciation modeling. We use 16 Gaussian mixtures per state. In the table below, we compare the baseline system to several published state-of-the-art systems.

Performance Summary for WSJ (SI-84 / Eval'92)
Site Acoustic Model Type Language Model Adaptation WER
CU word internal / gender-indepedent bigram none 8.1%
UT word internal / gender-dependent bigram none 7.1%
ISIP cross-word / gender-indepedent bigram none 8.2%
CU cross-word / gender-indepedent bigram none 6.9%
LT cross-word / gender-indepedent bigram none 6.8%
CU cross-word / gender-dependent bigram none 6.6%
UT cross-word / gender-dependent bigram none 6.4%
UT cross-word / gender-dependent bigram VTLN 6.2%
LT cross-word / gender-indepedent trigram none 5.0%
LT cross-word / gender-dependent trigram none 4.8%
LT cross-word / gender-dependent / tag trigram none 4.4%

  • CU: Cambridge University, UK
  • UT: University of Technology, Germany
  • LT: Lucent Technologies


  • The table below contains an evaluation of our 4-mixture cross-word baseline system across a wide range of noise conditions using the ETSI-standard front end. The audio data was not compressed in these experiments. Noise was digitally added to the original WSJ audio data at prescribed SNR values. Each cell in the table is linked to more information about the experimental conditions represented by that cell.

    Performance Summary (Without Compression)
    Training Set Testing Set
    Set Sampling Frequency Utterance Detection 1 2 3 4 5 6 7 8 9 10 11 12 13 14
    1 16 kHz No 14.9% 65.2% 69.2% 63.1% 72.3% 69.4% 73.2% 61.3% 81.7% 82.5% 75.4% 83.8% 81.0% 84.1%
    16 kHz Yes 14.0% 56.6% 57.2% 54.3% 60.0% 55.7% 62.9% 52.7% 74.3% 74.3% 67.5% 75.6% 71.9% 74.7%
    8 kHz Yes 16.2% 49.6% 62.2% 58.7% 58.2% 61.5% 61.7% 37.4% 59.7% 69.8% 67.7% 72.2% 68.3% 67.9%
    2 16 kHz No 23.5% 21.9% 29.2% 34.9% 33.7% 33.0% 35.3% 49.3% 45.2% 49.2% 48.8% 51.7% 49.9% 49.0%
    16 kHz Yes 19.2% 22.4% 28.5% 34.0% 34.0% 30.0% 33.9% 45.0% 43.9% 47.2% 46.3% 51.2% 46.6% 50.0%
    8 kHz Yes 18.4% 24.9% 37.6% 39.3% 38.8% 38.2% 40.4% 29.7% 37.3% 48.3% 46.1% 50.6% 44.9% 49.3%
    3 16 kHz No 20.6% 23.2% 34.4% 40.1% 38.2% 34.7% 41.3% 46.8% 49.1% 53.5% 53.4% 57.2% 53.2% 56.1%

    The table below contains the same evaluation described in the previous table, but in this case the audio was processed through a standard compression algorithm that simulates the cellular phone environment. More details on this will follow after we begin these experiments.

    Performance Summary (With Compression)
    Training Set Testing Set
    Set Sampling Frequency Utterance Detection 1 2 3 4 5 6 7 8 9 10 11 12 13 14
    1 16 kHz Yes 14.5% 58.4% 58.8% 53.8% 62.5% 56.9% 65.5% 53.3% 75.1% 76.3% 68.5% 77.8% 73.5% 75.9%
    8 kHz Yes 15.4% 49.4% 60.6% 59.0% 57.4% 61.9% 62.0% 36.6% 59.9% 71.6 67.8% 72.5% 70.2% 69.5%
    2 16 kHz Yes 19.1% 23.4% 31.7% 35.5% 35.3% 33.1% 36.4% 40.9% 47.4% 50.3% 48.9% 54.7% 49.3% 51.8%
    8 kHz Yes 20.7% 26.4% 38.6% 41.6% 43.8% 41.1% 43.4% 30.9% 38.7% 47.1% 50.1% 53.6% 47.3% 50.7%

    Footer

    Up | Home | Courses | Projects | Proposals | Publications
    Please direct questions or comments to joseph.picone@gmail.com