Aurora - Performance

Explanation | Baseline Performance | Without Compression | With Compression
On this web page we present a summary of the results for several training and testing conditions being evaluated in the Aurora competition. Detailed explanations of the baseline system and the corresponding evaluation conditions are available on-line. In these evaluations, we will test the following conditions using the DARPA Wall Street Journal (WSJ) Corpus:

Additive Noise: six noise conditions collected from street traffic, train stations, cars, babble, restaurants and airports will be digitally added to the speech data to simulate degradations in the signal-to-noise ratio of the channel.
Sample Frequency Reduction: the reduction in accuracy due to decreasing the sample frequency from 16kHz to 8kHz will be calibrated.
Microphone Variation: performance for both microphone conditions contained in the WSJ0 corpus will be analyzed.
Compression: degradations due to data compression of the feature vectors will be evaluated.
Model Mismatch: degradations due to mismatch between training and evaluation

The baseline system to be used for the Aurora evaluations is based on a public domain speech recognition system that has been under development at the Institute for Signal and Information Processing (ISIP) at Mississippi State University for several years. This system is implemented entirely in C++ and is fairly modular and easy to modify. It has been used on several evaluations conducted by NIST and the Naval Research Laboratory. We use hidden Markov model based context-dependent acoustic models, lexical trees for cross-word acoustic modeling, N-gram language models with backoff probabilities for language modeling (finite state networks are also supported), and a tree-based lexicon for pronunciation modeling. We use 16 Gaussian mixtures per state. In the table below, we compare the baseline system to several published state-of-the-art systems.

Performance Summary for WSJ (SI-84 / Eval'92)
Site	Acoustic Model Type	Language Model	Adaptation	WER
CU	word internal / gender-indepedent	bigram	none	8.1%
UT	word internal / gender-dependent	bigram	none	7.1%
ISIP	cross-word / gender-indepedent	bigram	none	8.2%
CU	cross-word / gender-indepedent	bigram	none	6.9%
LT	cross-word / gender-indepedent	bigram	none	6.8%
CU	cross-word / gender-dependent	bigram	none	6.6%
UT	cross-word / gender-dependent	bigram	none	6.4%
UT	cross-word / gender-dependent	bigram	VTLN	6.2%
LT	cross-word / gender-indepedent	trigram	none	5.0%
LT	cross-word / gender-dependent	trigram	none	4.8%
LT	cross-word / gender-dependent / tag	trigram	none	4.4%

CU: Cambridge University, UK

UT: University of Technology, Germany

LT: Lucent Technologies

The table below contains an evaluation of our 4-mixture cross-word baseline system across a wide range of noise conditions using the ETSI-standard front end. The audio data was not compressed in these experiments. Noise was digitally added to the original WSJ audio data at prescribed SNR values. Each cell in the table is linked to more information about the experimental conditions represented by that cell.

Performance Summary (Without Compression)
Training Set			Testing Set
Set	Sampling Frequency	Utterance Detection	1	2	3	4	5	6	7	8	9	10	11	12	13	14
1	16 kHz	No	14.9%	65.2%	69.2%	63.1%	72.3%	69.4%	73.2%	61.3%	81.7%	82.5%	75.4%	83.8%	81.0%	84.1%
	16 kHz	Yes	14.0%	56.6%	57.2%	54.3%	60.0%	55.7%	62.9%	52.7%	74.3%	74.3%	67.5%	75.6%	71.9%	74.7%
	8 kHz	Yes	16.2%	49.6%	62.2%	58.7%	58.2%	61.5%	61.7%	37.4%	59.7%	69.8%	67.7%	72.2%	68.3%	67.9%
2	16 kHz	No	23.5%	21.9%	29.2%	34.9%	33.7%	33.0%	35.3%	49.3%	45.2%	49.2%	48.8%	51.7%	49.9%	49.0%
	16 kHz	Yes	19.2%	22.4%	28.5%	34.0%	34.0%	30.0%	33.9%	45.0%	43.9%	47.2%	46.3%	51.2%	46.6%	50.0%
	8 kHz	Yes	18.4%	24.9%	37.6%	39.3%	38.8%	38.2%	40.4%	29.7%	37.3%	48.3%	46.1%	50.6%	44.9%	49.3%
3	16 kHz	No	20.6%	23.2%	34.4%	40.1%	38.2%	34.7%	41.3%	46.8%	49.1%	53.5%	53.4%	57.2%	53.2%	56.1%

The table below contains the same evaluation described in the previous table, but in this case the audio was processed through a standard compression algorithm that simulates the cellular phone environment. More details on this will follow after we begin these experiments.

Performance Summary (With Compression)
Training Set			Testing Set
Set	Sampling Frequency	Utterance Detection	1	2	3	4	5	6	7	8	9	10	11	12	13	14
1	16 kHz	Yes	14.5%	58.4%	58.8%	53.8%	62.5%	56.9%	65.5%	53.3%	75.1%	76.3%	68.5%	77.8%	73.5%	75.9%
1	8 kHz	Yes	15.4%	49.4%	60.6%	59.0%	57.4%	61.9%	62.0%	36.6%	59.9%	71.6	67.8%	72.5%	70.2%	69.5%
2	16 kHz	Yes	19.1%	23.4%	31.7%	35.5%	35.3%	33.1%	36.4%	40.9%	47.4%	50.3%	48.9%	54.7%	49.3%	51.8%
2	8 kHz	Yes	20.7%	26.4%	38.6%	41.6%	43.8%	41.1%	43.4%	30.9%	38.7%	47.1%	50.1%	53.6%	47.3%	50.7%