/ Recognition / Fundamentals / Production / Tutorials / Software / Home
4.3.6 Scoring: Significance Testing
Section 4.3.6: Significance Testing

Although hypothesis scoring gives us a good idea of how well a recognition system performs on a set of data, it is not the best way to compare the performance of two different recognition systems to determing which one is better. For this task, significance testing is often used.

Instead of looking at an entire utterance transcription at one time, significance testing usually splits the transcriptions into segments consisting of several words. The segments are specific to the pair of systems being compared. They are bounded on both sides by words correctly recognized by both systems (or by the beginning or end of utterance). See the figure below:

The significance test involves the difference in the numbers of errors of the two systems in each segment. The mean of these differences is used along with a control parameter called the "significance level" to determine through an experiment if one recognition system is significantly better than another. For a more technical definition of this test, see this report.

Now that you have a basic understanding of significance testing, let's run through a simple example. This examples will use the results from the experiments in Section 4.2.4, word-internal models, and Section 4.2.5, cross-word models. Go to the following directory:
    cd $ISIP_TUTORIAL/sections/s04/s04_03_p06/
This directory contains several files including hypotheses generated by the two different experiments, and a script called isip_eval_sgml.sh. The following test will attempt to determine if one system is significantly better than the other. Run the command:
    isip_eval_sgml.sh score $ISIP_TUTORIAL/research/isip/databases/lists/identifiers_test.sof reference.score results_01.score
Expected Output:
    ./isip_eval_sgml.sh> converting from isip_word format to score format .....
    ./isip_eval_sgml.sh> evaluating using sclite .....
    /usr/local/sctk/bin/sclite -F -i swb -r reference.score -h results_01.score.score -o sgml
    sclite: 2.2 TK Version 1.2
    Begin alignment of Ref File: 'reference.score' and Hyp File: 'results_01.score'
    Alignment# 18 for speaker ah
    Alignment# 17 for speaker ar
    Alignment# 17 for speaker at
    Alignment# 17 for speaker bc
    Alignment# 17 for speaker be
    Alignment# 17 for speaker bm
    Alignment# 17 for speaker bn
    ....
This command aligns the hypothesis file to the reference file and splits the utterances into segments of the type described above. Two files are created: results_01.score.report and results_01.score.sgml. The results_01.score.report is empty, and we will ignore it. The file results_01.score.sgml is an sgml score file and will be used later with the score file of the second system to test the two systems. Now that we have the alignments for the results of the first system, we need to extract the alignments for the results of the second system.

Run the command:
    isip_eval_sgml.sh score $ISIP_TUTORIAL/research/isip/databases/lists/identifiers_test.sof reference.score results_02.score
This command generates two more files: results_02.score.report and results_02.score.sgml. Once again, the file results_02.score.sgml is the sgml score file for the second system. We can now use these two sgml score files to compare both systems.

Run the command:
    cat results_01.score.sgml results_02.score.sgml | sc_stats -p -t mapsswe -v -u -n result_sys_01_sys_02
Expected output:
sc_stats: 1.2
Beginning Multi-System comparisons and reports
    Performing the Matched Pair Sentence Segment (Word Error) Test
        Output written to 'result_sys_01_sys_02.stats.mapsswe'
    Printing Unified Statistical Test Reports
        Output written to 'result_sys_01_sys_02.stats.unified'

Successful Completion
This command uses NIST's sc_stats tool perform a two-tailed significance test with the null hypothesis that there is no performance difference between the two systems. Two files are generated: result_sys_01_sys_02.stats.mapsswe and result_sys_01_sys_02.stats.unified. The file ending with .unified contains the report. The other file is empty and we will ignore it. The report consists of a detailed explanation of how to read the significance findings between the two systems. Click here to see an example of this report.
   
Table of Contents   Section Contents   Previous Page Up Next Page
      Glossary / Help / Support / Site Map / Contact Us / ISIP Home