March / Monthly / Tutorials / Software / Home

Let's begin with a brief description of the XML structure that allows us to create linear speech sequences, the < item > tag, and the structure that allows us to create branches, the < one-of > tag. We'll then discuss more advanced uses of those same tags, including nested combinations of the tags, weights, and self loops. Once those basic building blocks have been discussed, we'll cover how to place speech sequences within XML rules and how to place rules within an XML grammar . Finally, we will see how ISIP's network conversion tool, isip_network_converter, can be used to convert XML format grammar s to JSGF format or ISIP's DiGraph format.


LINEAR SEQUENCES


To create a linear speech sequence, simply place the sequence between a pair of item tags or rule tags. We'll cover other uses of both item and rule tags later in the tutorial.

    < item > a simple speech sequence </ item >
    < rule id= "example" > a simple speech sequence </ rule >
These represent the following graph:

and would match the phrase "a simple speech sequence", with no possible variations. (Nevermind the truncated symbol names on the graph; isip_network_builder only displays 5 letters of the symbol associated with a node.)

BRANCHES

To create a branch, place tokens between a < one-of > and a </ one-of> tag.
    <one-of>
       simple
       typical
       complex
    </one-of>
This represents the graph you see to the right, and would match either "simple", "typical", or "complex".


COMBINING LINEAR SEQUENCES AND BRANCHES

The two sets of tokens may be combined as follows
    <item>
        a
        <one-of>
            simple typical complex
        </one-of>
        speech sequence
    </item>
to form this graph:

This grammar fragment would match "a simple speech sequence", "a typical speech sequence", or "a complex speech sequence".

If you wish to make a particular branch within the <one-of> structure contain a sequence of multiple tokens (or more nested sequence/branch structures), nest the branch in a pair of <item> </item> tags.

For example:
    <item>
        a
        <one-of>
            <item> very simple </item>
            typical
            <item> very complex </item>
        </one-of>
        speech sequence
    </item>
This grammar fragment represents

and would match "a very simple speech sequence", "a typical speech sequence", and "a very complex speech sequence".

These <item> and <one-of> branches may be nested as many times as needed to represent more complex graphs, but for the purposes of this tutorial, we will cover only fairly simple structures.

USING WEIGHTS

Where branches exist, weights may be placed. To place a weight, put the attribute "weight" within a start <item> tag, and place a floating point value for a weight. Only <item> tags nested directly within <one-of> tags may have weights. If weights are encountered elsewhere, they will be ignored.
    <item>
       <item weight="0.33"> a </item> <!-- note, this weight is illegal -->
                                                               <!-- and will be ignored -->
       <one-of>
           <item weight="0.2">very simple </item>
           <item weight="0.3">typical </item>
           <item weight="0.5"> very complex </item>
       </one-of>
       speech sequence
    </item>
These weights represent the probability that a particular branch will be taken.


USING THE REPEAT ATTRIBUTE

Any item tag may contain the attribute "repeat", in order to allow a single token or node to occur multiple times. The repeat attribute may legally have a variety of values allowing the node to occur optionally, a finite specified number of times, and/or an infinite number of times; however, our software does not support a finite value for the repeat attribute. Any value given for the repeat attribute will be treated the same as a value of "1-" would, which means the node may occur once or any number of times greater than once. See below for an example of the use of the repeat attribute.
    <item>
       <item weight="0.33"> a </item> <!-- note, this weight is illegal -->
                                                               <!-- and will be ignored -->
       <one-of>
           <item weight="0.2">very simple </item>
           <item weight="0.3">typical </item>
           <item weight="0.5">
              <item repeat="1-"> very </item>
               complex
           </item>
       </one-of>
       speech sequence
    </item>

This sequence would allow "a very simple speech sequence", "a typical speech sequence", "a very complex speech sequence", or "a very very very complex speech sequence".

Weights may be attached to repeat loops by adding the "repeat-prob" attribute to the <item> tag which contains the "repeat" attribute.
    <item>
       <item weight="0.33"> a </item> <!-- note, this weight is illegal -->
                                                               <!-- and will be ignored -->
       <one-of>
           <item weight="0.2">very simple </item>
           <item weight="0.3">typical </item>
           <item weight="0.5">
              <item repeat="1-"
                        repeat-prob="0.5"> very </item>
               complex
           </item>
       </one-of>
       speech sequence
    </item>


CREATING RULES

Speech sequences must be contained within a rule. To define a rule, simply place the speech sequence between a <rule id="rulename"> tag, and a </rule> tag. The <rule> tag must have an attribute called "id", whose value is set to the name of the rule. Placing the speech sequence from above within a rule, we have
    <rule id="example">
        <item>
          a
          <one-of>
              <item > very simple </item>
              <item > typical </item>
              <item > very complex </item>
          </one-of>
          speech sequence
       </item>
    </rule>
Rules may refer to other rules by means of a <ruleref uri="#rulename"/> tag. For instance, if we define a rule
    <rule id="complexity">
        <one-of>
           <item > very simple </item>
           <item > typical </item>
          < item > very complex </item>
        </one-of>
    </rule>
We could then define the "example" rule as
    <rule id="example">
        <item>
           a
           <ruleref uri="#complexity"/>;
          speech sequence
       </item>
    </rule>
and achieve the same result. The # before the referenced rule name means that the rule is defined locally (within the same grammar). Non-local rule references are generally allowed, and may be specified by putting a URL as the value of the uri attribute, but are not currently supported by our system.

CREATING A GRAMMAR

Just as speech must be contained in rules, rules must be contained in a grammar. To place a rule within a grammar, simply place the entire rule between <grammar root="example"> and </grammar> tags. The attribute "root" tells the speech processor where to start processing: here the rule "example" will be read and processed first, and any rules referenced by the root rule directly or indirectly will be processed as well.

One additional requirement is the presence of the XML header tag at the top of any XML document, including grammars. An example header would be <?xml version="1.0"?>. The header may also include many other attributes that provide information about the grammar, but none of them are currently supported by our software. For more information regarding those attributes, see the W3C Speech Recognition Grammar Specification Version 1.0

Placing our prevously defined rule within grammar tags, and adding the XML header tag produces a complete grammar.
    <?xml version="1.0"?>
    <grammar root="example">

    <rule id="example">
        <item>
           a
           <ruleref uri="#complexity"/>;
          speech sequence
       </item>
    </rule>

    <rule id="complexity">
        <one-of>
           <item > very simple </item>
            <item > typical </item>
          < item > very complex </item>
        </one-of>
    </rule>

    </grammar>


CREATING AN SOF FILE CONTAINING XML GRAMMARS

In order to use an XML format grammar with our software, the grammar must be contained in a Signal Object File (SOF) file. Once in SOF format, the grammars may be converted using isip_network_converter to produce grammars in JSGF format or the ISIP DiGraph format, either of which may be used as input for a variety of ISIP tools.

A SOF file requires that the header @ Sof v1.0 @ be present at the top of the file, and that objects written in the file be delimeted by object tags, which look like @ XML 0 @, where 0 is an index which distinguishes multiple objects of the same type from eachother, and XML is the type of object.

There is another requirement imposed specifically upon grammar objects, but before I mention it, let me give you a little background concerning the organazation of grammars within ISIP software. ISIP software organizes grammars into multiple levels, so that a user may define a high-level grammar, and then break down the individual elements of that grammar into smaller elements. An example would be a word-level grammar, where each node in the graph represents a single spoken word. Beneath the word-level, a phone-level could be defined, which would contain a sub-grammar for each word in the word-level. Each of these sub-grammars would contain a definition of a word broken down into its basic phones. The requirement based on this organization is that the name of the level has to be specified. The level name may be arbitrary; in the example discussed above, the top level's name would probably be "word", and the sub-level's name would be "phone".

Observe the following example:
    @ Sof v1.0 @
    @ XML 0 @
    <?xml version="1.0" encoding="utf-8"?>
    <grammar root="sentence">
    <rule id="sentence">
       seven four six
    </rule>
    </grammar>

    @ search_tag 0 @
    <?xml version="1.0" encoding="utf-8"?>
    <grammar root="level0">
    <rule id="level0">
        word
    </rule>
    </grammar>

    @ XML 1 @
    <?xml version="1.0" encoding="utf-8"?>
    <grammar root="seven">
    <rule id="seven">
       s eh v ih n
    </rule>
    </grammar>

    <?xml version="1.0" encoding="utf-8"?>
    <grammar root="four">
    <rule id="four">
       f ow r
    </rule>
    </grammar>

    <?xml version="1.0" encoding="utf-8"?>
    <grammar root="six">
    <rule id="six">
       s ih k s
    </rule>
    </grammar>

    @ search_tag 1 @
    <?xml version="1.0" encoding="utf-8"?>
    <grammar root="level1">
    <rule id="level1">
        phone
    </rule>
    </grammar>
This SOF file contains a "two-level grammar" that is comprised of a word level and a phone level. The top level contains one grammar which repersents the phrase "seven four six", and the second level contains three grammars, each with the phonetic break-down of one word from the level above. The additional requirement I mentioned above to which grammars must conform is the definition of the name of the grammar level. The @ search_tag X @ is a SOF header identifying the location of the name of the level which is indexed by the number X. The name of the level is itself defined as the sole element within an XML grammar beneath that header.

Dummy symbols within the grammar may be specified in a similar fashion. Using the header
@ search_dummy_symbols X @, a grammar may be included whose comprising elements will all be treated as dummy sybmols (any occurence of a dummy symbol will be ignored by a speech processor). For example, if one wished to ignore all occurences of the words "four" and "six" in the two-level grammar defined above, one would only have to include
    @ search_dummy_symbols 0 @
    <?xml version="1.0" encoding="utf-8"?>
    <grammar root="sentence">
    <rule id="sentence">
       four six
    </rule>
    </grammar>

at the end of the level definition. Note that the index 0 is the index of the level in which the dummy symbols exist.

CONVERTING AN XML GRAMMAR TO DIGRAPH OR JSGF FORMAT

Following our next release, ISIP's general purpose network conversion tool, isip_network_converter, will support conversion of an XML grammar to DiGraph or JSGF format. If you do not currently have ISIP's software installed, you may find detailed instructions on how to download and install our software on your system here. Once you have an ISIP repository set up, you can use isip_network_converter as follows:

isip_network_converter -input_format XML -output_format DIGRAPH -output_type TEXT input_grammar.sof output_grammar.sof output_stat_model_pool.sof

Allowed input formats are XML, DIGRAPH, and JSGF. Allowed output formats are DIGRAPH and JSGF. Allowed output types are TEXT and BINARY.

The file grammar_input.sof is an XML grammar in an SOF file, the file grammar_output.sof is the file which will be generated by isip_network_converter in the format of your choice, and the output_stat_model_pool.sof is a statistical model pool associated with your grammar that is generated by the network converter. The statistical model pool is used by our speech recognizer, but is beyond the scope of this tutorial.

If you wish to view or manipulate your results in graphical form, the DiGraph or JSGF output file may be opened with isip_network_builder.