November / Monthly / Tutorials / Software / Home

Text parsing is one of the more essential software components in speech recognition research since we ultimately process human language in both audio and text formats. Languages that allow easy manipulation and parsing of text, such as PERL, have become very popular in speech research. In fact, PERL was written by linguists to fulfill their need for an easy-to-use, flexible and extensible language. PERL supports powerful regular expressions that include a pattern-matching operator to find a pattern in a string, a substitution operator to substitute one string for another, and a split operator to parse the string based on a delimiter. Text processing can be done quickly and efficiently in PERL, but such a language is not necessarily optimal for computationally-intensive research tasks such as speech recognition.

The ISIP Foundation Classes (IFCs) provide extensive support for string processing so that such code can be easily integrated at a programming level with other speech recognition software. Most text parsing and text processing functionality is implemented within the SysString class that belongs to the system library level of the IFC class hierarchy. Users typically access this functionality through the String class, which inherits SysString. Several important features of this interface are described below.
  • string tokenize methods: parse a string into smaller substrings based on a user-defined delimiter;

  • count token methods: count the number of tokens in the given string;

  • replace/insert methods: substring manipulations;

  • string search methods: search a given string for the position of the first or the last occurrence of a character or a substring;

  • string/numeric concatenation methods: concatenate strings;

  • trim methods: remove certain characters or substrings from the input string.
Let us consider a few simple examples to constrast string processing in PERL and the IFCs. These examples cover some of the functionality described in the previous paragraph. Let us first consider tokenization, one of the more important functions in language processing. This function is the equivalent of split in PERL. The PERL code to parse the words in the sentence Jack and Jill went up to hill is given below:

    #! /usr/local/bin/perl
    # file: ./examples/example_01.pl
    #
    
    # sentence to be parsed
    #
    $sentence = "Jack    and Jill went up to hill";
    
    # split the words using multiple spaces as a delimiter
    #
    @words = split(/\s+/, $sentence);
    
    # get the count of the words
    #
    $count = $#words;
    
    # print the parsed words on the console, one in each line
    #
    for ($i=0; $i <=$count; $i++) {
        print "word = @words[$i]\n";
    }
(Click here to download this code.)

Comparable code in the IFCs uses the tokenize function:

    // file: ./examples/example_01.cc
    //
    // isip include files
    //
    #include <String.h>
    #include <Vector.h>
    
    // main program starts here
    //
    int main(int argc, const char **argv) {
      
      // declare the sentence as an String object
      //
      String sentence(L"Jack     and Jill went up to hill");
      
      // get the counts of the words
      //
      long count = sentence.countTokens(L" ");
    
      // declare the vector of words
      //
      Vector words(count);
      
      // local variable position that returns position on string where
      // next delimiter is
      //
      long pos = 0;
    
      // get each word by tokenizing using multiple spaces as a delimiter
      //
      for (long i = 0; i < count; i++) {
        sentence.tokenize(words(i), pos, L" ");
      }
    
      // print the words on the console one at a time using the debug
      // method
      //
      for (long i = 0; i < count ; i++) {
      words(i).debug(L"word");
      }
      
      // exit gracefully
      //
      Integral::exit();
    }
(Click here to download this code.)

The next example demonstrates pattern matching and substitution. Consider an example in which we want to replace all occurrences of the string five with the string ten. The following code, written in PERL, uses the global substitution operator:
    #! /usr/local/bin/perl
    # file: ./examples/example_02.pl
    #
    
    # sentence to be modified
    #
    $sentence = "six one five four nine five three";
    
    # replace the "five" by "ten" at all occurrences using global
    # substitution
    #
    $five = "five";
    $ten = "ten";
    $sentence =~ s/$five/$ten/g;
    
    # print the modified sentence to the console
    #
    print "modified sentence: $sentence\n";
(Click here to download this code.)

The same functionality implemented with the IFCs would use the replaceAll function. Here we replace all the occurrences of the string five with the string ten:
    // file: ./examples/example_02.cc
    //
    // isip include files
    //
    #include <String.h>
    
    // main program starts here
    //
    int main(int argc, const char **argv) {
      
      // declare the sentence as an String object
      //
      String sentence(L"six one five four nine five three");
      
      // replace the "five" by "ten" at all occurrences
      //
      sentence.replaceAll(L"five", L"ten");
    
      // print the modified sentence to the console
      //
      sentence.debug(L"modified sentence");
      
      // exit gracefully
      //
      Integral::exit();
    }
(Click here to download this code.)

One of the most frequent uses of text parsing is to load parameter data from files to configure programs. There are several parsers available within the IFC environment to do such things, and users rarely have to write any custom code to accomplish this task. Another way we avoid the need to do intensive amounts of text processing is to avoid the use of unformatted data in our environment. Most data is stored using a Signal Object File (Sof) representation that makes it easy to read and write such data to files.