IMPROVING RECOGNITION PERFORMANCE IN NOISY ENVIRONMENTS

Recently, there has been renewed interest in examining the problem of speech recognition in noisy environments. Three applications fueling this interest are cellular telephony, voice interfaces for automotive applications, and hands-free interfaces for portable computing devices. All share to some extent a common problem that systems must be robust to impulsive non-stationary noise signals, many of which have not been previously observed in the training data. Further, we have evidence that training algorithms are notoriously prone to overfitting models to the dominant modes in the data, and have trouble capturing infrequently occurring or inconsistent data such as noise (EM moves means and variances too quickly towards speech). Similarly, well-known adaptation techniques such as Maximum Likelihood Linear Regression (MLLR) often aren't appropriate since they require larger amounts of training data, or introduce unacceptable latency in the system.

There have historically been two approaches to this problem that are relevant to the workshop: enhancing the acoustic front end and developing noise-enhanced acoustic models. The former approach is perhaps most popular and typically includes a class of front ends that have built-in noise adaptation (e.g., estimating the background channel on the fly) and nonlinear spectral processing (e.g., spectral peak clipping). The latter approaches attempt to integrate into the state-based statistical model some form of noise estimation, either via a subspace estimator or through a separate statistical model for noise. (Beyond these approaches, there is renewed interest in transducer design and adaptive filtering, but we would recommend not pursuing these approaches.)

The recently completed Aurora evaluation, part of the ETSI standards activity to develop a standard for feature extraction for client/server applications in cellular telephony, would provide a nice framework in which to conduct this research. The evaluation task is based on the WSJ 5,000 word closed vocabulary task, and includes clean speech, telephone-bandwidth speech, and speech degraded by digitally-added noise. Performance on clean conditions (matched training and testing) is on the order of 7% WER. Performance on the noise conditions, even after training on noisy data, is on the order of 30% WER. See: 
 
 http://www.isip.msstate.edu/projects/aurora/performance/index.html

for more details on the baseline performance. Two noise-adaptive front ends were recently developed by a consortium of sites interested in this problem which reduced error rates by 50%, but performance was still far from that achieved in clean conditions.

We believe a variety of approaches could be pursued. Three that we are interested in are:

 - linguistically-motivated noise reduction: can we use higher-level
   linguistic knowledge to better distinguish speech from noise
   within the recognition loop;

 - state-level statistical models: can we learn noise distributions at the
   state-level or exploit knowledge about phonetic context to improve
   discrimination at the state or phone level;

 - machine learning of noise distributions: can we iteratively learn noise
   distributions at the state level much like we learn signal distributions;
   alternately, can we estimate the parameters of typical subspace rotations
   using EM-based techniques rather than restricting such methodologies to
   front end processing outside the recognizer.

There has also been a lot of work recently on noise adaptation methods and multistream processing that might be relevant to this problem.