In a typical state-of-the-art large vocabulary conversational
speech recognition (LVCSR) system, a single speech model is
developed using data from a large number of speakers to cover the
variance across dialect, speaking styles, etc. Since the speech
model is the average of all the speakers, the speech recognition
results should be the average of all speakers. Such system is
called a speaker independent system. The drawback of such a system
is that its performance is not optimal for any particular
speaker. To make the recognition system perform optimal for a
particular speaker, the best paradigm is to construct a system
using all the data only collected from this speaker. Such a system
is called speaker dependent system. The speaker dependent system
usually performs badly for other speakers. At the same time
collecting a lot of data from one speaker is a very difficult
task.
The obvious solution is that we can use the speaker independent
system, by collecting a small amount of data from a new
speaker. The system can then be adapted to fit specific feature of
the new speaker. The new system, thus created, will give better
performance for this speaker. The performance of the new system
will lie between the speaker independent system and speaker
dependent system. The more adaptation data, the more closer to
speaker dependent system.
The Maximum Likelihood Linear Regression (MLLR) can be used to
perform such an adaptation, and it will be released with version
r00_n12 of our
production system.
You can monitor the progress of this release using our
asr mailing
list.
This tutorial provides steps on how to run our production system
using MLLR adaptation. The theory behind this implementation can
be found in the dissertation:
- Chris J. Leggetter, Improved Acoustic Modeling for HMMs
Using Linear Transformations, PhD thesis, Department of
Engineering, University of Cambridge. February 1995,
Commonly, it is assumed that the primary difference between
speakers is in the average position of phones in the acoustic
space. In other words, the mean adaptation gives profound
performance. Currently, we have only implemented the mean
adaptation and in this tutorial, we mainly discuss the mean
adaptation.
By using some given adaptation data, MLLR can build a single
global transform to adapt all models. We can use following
equation to get a new estimate of the adapted mean for a model:
where n is the dimensionality of the data, W is the n x (n+1)
transformation matrix and
is the extended mean vector.
is the offset indicator, usually set to 1.0. Estimating the
transformation matrix (W) is the core of the MLLR adaptation.
As more data becomes available, we can do better by classifying
the models into different classes and getting fine-grained
transformations for each class. A regress class tree plays a
critical role to manage those processes. According to the amount,
and type of adaptation data available, the set of transformations
can be chosen through the regression class tree.
In this tutorial, we will cover the process of MLLR adaptation of
a speaker and the command line interfaces for MLLR adaptation.
- The process of MLLR adaptation for a speaker:
Using an existing model to conduct MLLR adaptation
involves four basic steps according to users'
specification. Those are: regression tree generation,
adaptation accumulation, transformation creation, and
adaptation of models.
- Regression Tree Generation:
The first step in MLLR adaptation is to create
regression decision tree. The regression decision tree
is constructed in such a way that the Gaussian
components, which are close in acoustic space, are put
in the same regression class and can be transformed in
a similar way. The input for this step is the
statistical acoustic models of the system. The output
for this step is a regression decision tree.
- Adaptation Data Accumulation:
Next, the adaptation data is accumulated. This step
is the same as the general training process of the
system. The input for this step is the model and
speech data of a specific speaker, and output is
models including the adaptation data.
- Transformation Creation:
Then, the regression decision tree and models which
accumulated the adaptation data are used to create
transformation matrix for each regression class,
which actually is the node of the regression decision
tree. The input for this step is the regression tree
built in the step i. and models included the
adaptation information from step ii. The output for
this step is the regression tree which includes the
transformation matrix for each node.
- Adaptation:
Finally, each component of a model is adapted by a
specific transform matrix which belongs to a
particular corresponding regression class. The input
for this step is the models and regression decision
tree, and the output is the adapted models.
- The command line for MLLR adaptation:
All four steps mentioned above can be processed in one
command line. The command line for MLLR adaptation is the
same as other cases of using isip_recognize. The only
difference is in the parameter file. Users need to specify
the options for MLLR adaptation in the parameter
file.
isip_recognize.exe -parameter_file
params/params_1.sof
-list lists/identifiers.sof -verbose brief
Finally, one more note: MLLR decoding is the same as
standard decoding, but you need to pay attention to using
models that are adapted (output of the adaptation process)
for each specific speaker.
In this tutorial, we gave a brief introduction for the process
for MLLR adaptation and command line interfaces. The MLLR
adaptation system usually gets much better performance than a
speaker independent system.