September / Monthly / Tutorials / Software / Home

The annotation graph represents the linguistic annotation of recorded speech data. The linguistic annotation, in the case of speech recognition, is simply an orthographic annotation of speech data, which may or may not be time-aligned to an audio recording [1]. The orthographic annotation, generally referred to as a transcription, is a label associated with the audio recording. The transcription along with the audio recording is used to train the speech recognition system in a surprised learning framework. The annotated transcription may include a hierarchy of linguistic, syntactic and semantic knowledge sources that needs to be conveniently represented.

The annotation graph provides a convenient means for representing a hierarchy of knowledge sources. An annotation graph may be used to represent a single transcription or an entire conversation depending on how the speech database is organized. This alleviates the problem of having multiple copies of the same transcription for each knowledge source, and it also provides an application programmer interface (API) to tag and query the various knowledge sources.

Framework

The design of the annotation graph framework follows the design specification given by S. Bird and M. Liberman at the Linguistic Data Consortium [2]. The following is a contrived example that will be used to demonstrate the essential elements of an annotation graph.

Annotation Graph Example

The annotation graph represents the knowledge sources for the orthographic transcription "the boy ran." The transcription is time-aligned to the audio recording, i.e., the audio recording contains the words in the transcription and has a duration of 1.4 seconds. The syntactic structure of the transcription (i.e., noun and verb phrases in this case) is represented as a separate layer in the annotation graph. The noun phrase, "the boy," lies in the interval [0.0, 0.2], and the verb phrase, "ran," lies in the interval [0.8, 1.4]. The phonetic structure, phones realized by the words, is also represented as a separate layer in the annotation graph. The word, "the," that constitutes the noun phrase of the transcription, is represented by the phone /dh/ followed by /ax/. The phones corresponding to the words in the transcription can be strung together to form a phonetic transcription.

The various layers in the annotation graph, described above, represents a different knowledge source in the linguistic annotation. The arcs in the annotation graph represent the linguistic notations applied to the raw language data. The nodes in the annotation graph represent the time offset corresponding to each linguistic notation. The nodes need not contain time offsets, they can be empty to indicate notations that are not time-aligned. This framework, allows the various linguistic knowledge source to co-exist within the same structure. The ability to represent the different knowledge source, and the flexibility of tagging and queering the notation, are the primary reasons why we chose to incorporate the annotation graph framework in our system.

One way we use annotation graphs is to represent the hypothesis generated by decoder during recognition. The decoder is hierarchical in nature, i.e., each level of the hierarchy represents a separate knowledge source. The hypothesis that is generated during recognition could represent the different levels in the hierarchy as a separate layer in the annotation graph. Therefore, each hypothesis generated is easily output as an annotation graph.

Example

The following is an example of how to build an annotation graph that contains the orthographic transcription "the" and its corresponding phones /dh/ and /ax/:

#include <AnnotationGraph.h>

int main(int argc, const char** argv) {

  String tmp_str;
  String name(L"CONTRIVED EXAMPLE");
  String type(L"TRANSCRIPTION");
  String unit(L"seconds");

  Float offset_00(0.0);
  Float offset_01(0.1);
  Float offset_02(0.2);

  AnnotationGraph angr(name, type);

  // annotation for the word "the"
  //
  tmp_str.assign(L"the");
  angr.createAnnotation(name,
                  angr.getAnchorById(angr.createAnchor(name, offset_00, unit)),
                  angr.getAnchorById(angr.createAnchor(name, offset_02, unit)),
                  tmp_str);

  // annotation for the phone "dh"
  //
  tmp_str.assign(L"dh");
  angr.createAnnotation(name,
                 angr.getAnchorById(angr.createAnchor(name, offset_00, unit)),
                 angr.getAnchorById(angr.createAnchor(name, offset_01, unit)),
                 tmp_str);

  // annotation for the phone "ax"
  //
  tmp_str.assign(L"ax");
  angr.createAnnotation(name,
                 angr.getAnchorById(angr.createAnchor(name, offset_01, unit)),
                 angr.getAnchorById(angr.createAnchor(name, offset_02, unit)),
                 tmp_str);

  // exit gracefully
  //
  Integral::exit();
}

The example above creates an annotation graph for the orthographic transcription "the" and the corresponding phonemic transcription, i.e., the phone /dh/ followed by /ax/. However, the example does not tag the different levels of the annotation graph. We must tag the different levels if we intend to extract the orthographic and phonemic transcriptions form the graph. The following example shows how to tag the different level in the annotation graph:

#include <AnnotationGraph.h>

int main(int argc, const char** argv) {

  String new_id;
  String tmp_str;
  String key;
  String value;

  String name(L"EXAMPLE");
  String type(L"TRANSCRIPTION");
  String unit(L"SECONDS");

  Float offset_00(0.0);
  Float offset_01(0.1);
  Float offset_02(0.2);

  AnnotationGraph angr(name, type);

  // annotation for the word "the"
  //
  tmp_str.assign(L"the");
  new_id = angr.createAnnotation(name,
                  angr.getAnchorById(angr.createAnchor(name, offset_00, unit)),
                  angr.getAnchorById(angr.createAnchor(name, offset_02, unit)),
                  tmp_str);

  // tag annotation with the value "ORTHOGRAPHIC"
  //
  key.assign(L"level");
  value.assign(L"ORTHOGRAPHIC");
  angr.setFeature(new_id, key, value);

  // annotation for the phone "dh"
  //
  tmp_str.assign(L"dh");
  new_id = angr.createAnnotation(name,
                 angr.getAnchorById(angr.createAnchor(name, offset_00, unit)),
                 angr.getAnchorById(angr.createAnchor(name, offset_01, unit)),
                 tmp_str);

  // tag annotation with the value "PHONETIC"
  //
  value.assign(L"PHONETIC");
  angr.setFeature(new_id, key, value);

  // annotation for the phone "ax"
  //
  tmp_str.assign(L"ax");
  new_id = angr.createAnnotation(name,
                 angr.getAnchorById(angr.createAnchor(name, offset_01, unit)),
                 angr.getAnchorById(angr.createAnchor(name, offset_02, unit)),
                 tmp_str);

  // tag annotation with the value "PHONETIC"
  //
  value.assign(L"PHONETIC");
  angr.setFeature(new_id, key, value);

  // exit gracefully
  //
  Integral::exit();
}

The example above allows us to extract the different levels of the annotation graph by referring to the tag. In the example above we can extract all annotations that have the level called "PHONETIC" or "ORTHOGRAPHIC". In the latter case the annotation with the notation "the" is returned, and in the former case the annotations with the notations /dh/ and /ax/ are returned.

Resources

The following links point to resources that will be helpful in learning how to build and use our annotation graph library. These links contain documentation and API's related to the annotation graph toolkit. More information on the Linguistic Data Consortium's Annotation Graph Toolkit can be found here.

References

  1. S. Bird and M. Liberman, "A Formal Framework for Linguistic Annotation," Linguistic Data Consortium, University of Pennsylvania, Philadelphia, Pennsylvania, USA, 2000.

  2. K. Maeda, X. Ma, H. Lee, and S. Bird, The Annotation Graph Toolkit: Application Developer's Manual (Draft), Linguistic Data Consortium, University of Pennsylvania, Philadelphia, Pennsylvania, USA, 2001 (see http://agtk.sourceforge.net/).