4.3.6 Scoring: Significance Testing
Although hypothesis scoring gives us a good idea of how well a recognition system performs on a set of data, it is not the best way to compare the performance of two different recognition systems to determing which one is better. For this task, significance testing is often used. Instead of looking at an entire utterance transcription at one time, significance testing usually splits the transcriptions into segments consisting of several words. The segments are specific to the pair of systems being compared. They are bounded on both sides by words correctly recognized by both systems (or by the beginning or end of utterance). See the figure below: The significance test involves the difference in the numbers of errors of the two systems in each segment. The mean of these differences is used along with a control parameter called the "significance level" to determine through an experiment if one recognition system is significantly better than another. For a more technical definition of this test, see this report. Now that you have a basic understanding of significance testing, let's run through a simple example. This examples will use the results from the experiments in Section 4.2.4, word-internal models, and Section 4.2.5, cross-word models. Go to the following directory:
./isip_eval_sgml.sh> evaluating using sclite ..... /usr/local/sctk/bin/sclite -F -i swb -r reference.score -h results_01.score.score -o sgml sclite: 2.2 TK Version 1.2 Begin alignment of Ref File: 'reference.score' and Hyp File: 'results_01.score' Alignment# 18 for speaker ah Alignment# 17 for speaker ar Alignment# 17 for speaker at Alignment# 17 for speaker bc Alignment# 17 for speaker be Alignment# 17 for speaker bm Alignment# 17 for speaker bn .... Run the command:
Run the command:
sc_stats: 1.2 Beginning Multi-System comparisons and reports Performing the Matched Pair Sentence Segment (Word Error) Test Output written to 'result_sys_01_sys_02.stats.mapsswe' Printing Unified Statistical Test Reports Output written to 'result_sys_01_sys_02.stats.unified' Successful CompletionThis command uses NIST's sc_stats tool perform a two-tailed significance test with the null hypothesis that there is no performance difference between the two systems. Two files are generated: result_sys_01_sys_02.stats.mapsswe and result_sys_01_sys_02.stats.unified. The file ending with .unified contains the report. The other file is empty and we will ignore it. The report consists of a detailed explanation of how to read the significance findings between the two systems. Click here to see an example of this report. |