4.4.1 Forced Alignment: Overview
As we've seen thus far, a speech recognition system uses a search engine along with an acoustic and language model which contains a set of possible words, phonemes, or some other set of data to match speech data to the correct spoken utterance. The search engine processes the features extracted from the speech data to identify occurences of the words, phonemes, or whatever set of data it is equipped to search for and returns the results.
Forced alignment is similar to this process, but it differs in one major respect. Rather than being given a set of possible words to search for, the search engine is given an exact transcription of what is being spoken in the speech data. The system then aligns the transcribed data with the speech data, identifying which time segments in the speech data correspond to particular words in the transcription data.
Forced alignment can also be used to align the phonemes of the transcription data to the speech data given, similar to the image below, although with more explicitly defined boundaries on where each phoneme begins and ends.