General comments to all reviewers:

We appreciate the excellent feedback. Please note that the
paper as it was originally submitted, was already at
the transactions limits. Hence, we have tried to thoroughly
address all the comments below without increasing the overall
length of the paper.

This work is described in detailed in Ganapathiraju's dissertation,
which is referenced heavily in the work. This dissertation is
available on our web site to facilitate access:

 http://www.isip.msstate.edu/publications/books/msstate_theses/2002/support_vectors/thesis/thesis_final.pdf

Reviewer No. 1: DETAILED COMMENTS
---------------------------------

> 1) Good tutorial on the evolution of pattern recognition
> techniques as applied to speech recognition.  Good description
> of the SVM and reasons for its use.  The latter may not be well
> known in the signal processing community.

This is one reason we are interested in publishing this work
in this special issue. In fact, since we started this work
in 1997, and demonstrated the first LVCSR applications in 1998,
SVMs have found their way into many speech applications. Most
recently, results on speaker verification have been demonstrated
by several sites. We are currently working on similar things.

> 2) Inadequate explanation and justification for eq. (13).  This
> only gives local information about the observable pdfs near the
> decision boundaries.

We don't have enough space to show the experimental fits to the
data, etc. We added more explanations in the first paragraph
of the section title "Posterior Estimation", including a reference to
Tipping's and Kwok's work, which we followed, and which address 
this issue in greater detail.

Also, Platt's work, which is referenced in the last paragraph in
this section, addresses issues in using cross-validation methods
to estimate these parameters.

> 3) Inadequate explanation of the overall recognition system.  I
> assume that it's a standard IBM style model but that must be
> made clear and explicit.

In the first sentence in section 2, we reference statistical
methods as described by Jelinek, and explicitly state these
are what we are using through this work. This first two paragraphs
in this section essentially briefly review the standard statistical
approach to speech recognition.

> 4) Inadequate evaluation of the results.  How can we conclude
> that there is a performance improvement.  Table 1 is not
> enough.

Table 1 demonstrates the efficacy of the basic classifier on
speech-like data. This was a significant historical step
forward which motivated us to do more work. Table 2
also describes improvements. We have added additional
discussion and reworked the explanation of Table 2.
Hopefully, the significance of Table 2 is clearer.

Note that we are not claiming astounding improvements in
performance. This is quite natural for a first attempt
at improving a speech recognition system. Row no. 3 in
Table 2 is very significant, since it suggests that there
could be significant improvements in performance IF we
solve the supervised training problem. This will be the
topic of a future paper.

> 5) Inadequate discussion of the pitfalls of the SVM when
> projecting into very high dimensional feature spaces.

This is a good point. Due to length constraints, we had to
drop many of these issues. We have published separate work
on this in:

 A. Ganapathiraju and J. Picone, "Support Vector Machines For 
 Automatic Data Cleanup," Proceedings of the International 
 Conference of Spoken Language Processing, vol. 4, pp. 210-213, 
 Beijing, China, October 2000. 

Overcoming some of the practical problems associated with
training SVMs was a major contribution. Again, the details
of this are described in Ganapathiraju's dissertation and
referenced in this paper.

Reviewer No. 2: DETAILED COMMENTS
---------------------------------

> p. 1. "Simplistic techniques...": this derogates early work
> in recognition and is an example of what I call a cavalier or
> immodest
> writing style. I suggest "simple techniques" instead.

Corrected.

> "relatively fragile" --> relative to what? Drop "relative", and
> be more specific in describing system fragility.

We have added more explanation and a reference.

> "we demonstrate this" --> instead "that" between last 2 words.

Corrected.

> p. 2-4. A bit too much review of HMM/EM fundamentals. Condense.
> 

We have shortened this section by one page. We could drop the discussion
Of ANNs, but several people have argued about this point when we have
Presented this work. Therefore, we think it is important to cite ANN 
material in this way.

> p. 3. 
> 
> "based as an optimization" --> "based ON an optimization"?
> 
> "in this example any amount of effort ... will not" --> "in this
> example
> NO amount of effort".
> 

Fixed.

> p. 4. "The primary different between ML-based HMM... the wrong
> model
> is used". This is a good point, but the phrasing is confusing.
> Perhaps
> something like "the objective criterion for the latter reflects
> classification
> performance even if the wrong probabilistic model is used"...?
> 

This paragraph was deleted in an effort to reduce the size of this section.

> p.5-8: This overview section seems slightly long.
> 
> p. 5. "Closed-loop optimality". I assume this refers to
> performance
> on the training set. I don't believe that "closed-loop" is a
> common 
> term in machine learning or speech recognition. 
> 
> "the design of a classifier is essentially a process" --> "the
> design
> of a classifier is essentially THE process"
> 
> p. 6, first paragraph introduces the term "loss function", but
> then
> this is re-introduced and italicized further down the page. 
> Italicize/emphasize the term at its first use.
> 
> p. 8-10. This section summarizes SVMs, but this is all review
> material. Suggest condensing.

Pages 5-10 have been condensed to only the essentials. We are concerned 
That this reduces the tutorial value of the paper.

> 
> p. 11. On page 11 of a paper whose conclusion starts on page 16,
> 
> finally we're starting to hear details of what the authors did
> in this study. I suggest adding one or two lines describing 
> in more detail the ML-approach to estimating A and B of
> the sigmoid.
> 
> "expense of increased computations" --> "expense of increased
> computation time"
> 

Extensive modifications were made to this description,
including a figure showing the results. This changes
should make it much easier for the reader to understand
the details of this work, and include references to the
definitive sources for this approach.

> p. 12. "the phone 's'" --> "the phone /s/". However, the point
> that the authors are trying to make is independent of the
> phone identity, so I suggest not introducing the phone
> identity.
> 

Good point - fixed.

> p. 13. What do the authors mean exactly by "we use segments
> composed
> of the three sections"? I assume three equal-dimension vectors
> are concatenated. What is the dimensionality of these vectors?
> How are they themselves derived? The following text mentions
> LPC (log-area parameters), but it's not clear if this is how
> the
> segments were calculated.
> 

We added a paragraph describing this in more detail,
and an associated figure.

> p. 14. "This was a good database" --> "This is a good database"
> 

Corrected.

> "We have observed similar trends on a number of static
> classification
> tasks" : details (or references) please. 

Let me first explain the source of this comment, and then
how we addressed it in the paper. We have used a number
of well-known pattern recognition data sets in our pattern
recognition course, and found that SVMs work very well
on these. In fact, we developed a Java applet that
demonstrates some of this. Obviously, you can't publish
such things, so we don't have conference or journal
publications to point to.

Since this paper was written, several results have recently
been published on different speech tasks. I think it is important
to show this to people, so they know our results are not
anomalous.

We have added a URL reference to the pattern recognition applet.
I think this is useful to people who want to explore static
classification tasks. I have also added a reference to a paper
published at Eurospeech'2003.

> p. 16. "When we allow the SVM to decide the best segmentation
> and
> hypothesis combination by using the N-best segmentations..."
> This point regarding the segmentations, either decided by the
> baseline HMM system or by the SVMs, is very unclear. A lot more
> detail
> is needed on the authors' proposed technique. The authors have
> provided
> several figures describing the fundamentals of SVMs, but no
> figures describing their own recognition system architecture. 
> Is their method essentially an N-best rescoring method? Will
> they
> always require a baseline HMM to provide N-best lists, or
> could the SVMs be used to supplant the HMM?
> 

We added a figure at the end of section 5 that summarizes the
overall system architecture. We have also modified the discussion
in section 6 to better explain the difference between using
reference segmentations and hypothesized segmentations.

We have also tried to be very clear that in this initial work
we are using a hybrid approach. This is sufficient for proof
of concept, but obviously not acceptable for mature technology.
Subsequent research will be published on more integrated
approaches. This is mentioned briefly in the summary.

> "Table 1" -- this should be Table 2, I think.
> 

Corrected.

> I strongly urge the authors to give slightly less details 
> about review material that is already published, and many
> more details on what they themselves did in this study.

We appreciate this reviewer's detailed comments and have tried
to accommodate them. Our only concern is that this review
material is still not well-known through the signal processing
community, particularly the speech community, and needs some
forum for publication. Ironically, the SP Magazine rejected
our request to publish an extensive review paper on risk
minimization, and recommended we pursue publication in the
Signal Processing transactions.

Reviewer No. 3: DETAILED COMMENTS
---------------------------------

> This paper proposes two deep, theoretically interesting research
> topics, then drops both of them without warning.  In the end,
> the experiments performed by the authors seem to be nothing more
> than standard SVM classification experiments, performed almost
> step by step according to the instructions in Vapnik's
> textbook.

I hope that this reviewer will appreciate that application
of these techniques to speech recognition is not trivial.
We have been upfront with the scope of this paper - it presents
the first application of SVMs to LVCSR tasks. We are not
claiming innovations with respect to the core SVM theory.

> Section 2 discusses the well-known problems with ML, and
> suggests MCE as a solution.  Section 3 discusses the principle
> of structural risk minimization.  Combination of these concepts
> would be very interesting, as MCE is itself a type of ERM;
> unfortunately, both of these concepts are dropped without
> warning, and the paper never returns to them.

We are currently working on a comparison of MMIE, MCE,
and MLLR with approaches such as SVMs and RVMs. Implementation
of effective MMIE and MCE approaches in a framework that
is compatible with our SVM and RVM systems is not trivial.
We hope this will be the subject of a future journal paper.
However, we believe history supports our contention that
papers must be published on the efficacy of the basic SVM
approach on speech problems before it can be compared to
mature competing discriminative training technologies.
We would like to publish at least one journal paper that
clearly describes the SVM approach in a speech recognition
application since there are many choices one can make for
each component. In fact, our system when it was first
publicly disclosed, is fairly unique in its combination
of components.

> Section 4 presents the support vector machine, and section 5
> proposes an SVM/HMM hybrid.  An SVM/HMM hybrid would be very
> interesting.  ANN/HMM hybrids use the classification power of an
> ANN in order to improve the observation PDF computation of an
> HMM.  As a result, the hybrid systems often outperform mixture
> Gaussian HMMs, at the cost of substantially increased
> computational complexity during training.  An SVM/HMM hybrid,
> using SVM to compute the observation PDFs of an HMM in some way,
> would presumably do even better than the ANN/HMM hybrid,
> presumably at even greater computational cost.  Unfortunately,
> this idea is dropped immediately after being suggested, because
> it is too computationally complex.

Our system is a hybrid SVM/HMM approach. We made some changes,
as described above, to make this clearer. We do not have
an equivalent ANN/HMM hybrid to compare to since we don't have
ANN technology in house. Building a state of the art ANN
system is a large task in itself. We hope someone else
who is expert in this technology can publish a comparison
to our system on comparable tasks.

> In the end, the authors perform n-ary classification of
> fixed-length segmental feature vectors.  As far as I can tell,
> the only interaction between the HMM system and the SVM system
> is that the HMM is used to find segment boundaries for the SVM
> to classify; the final log-probabilities of the HMM and the SVM
> are then apparently added together in order to create the final
> recognition score.  In my view, adding together the scores of
> two classifiers, both of which are trained and tested according
> to standard published references, does not constitute a "hybrid"
> system of sufficient novelty to merit publication.  

We respectfully disagree though we support the reviewer's efforts
to maintain a high threshold for publication. One could argue
that almost any innovation in speech recognition, such as the
use of HMMs, fits the above description for lack of novelty.

In creating an SVM system, there were a lot of issues that had
to be explored. Ganapathiraju's dissertation goes into much more
detail on many things that were tried that DID NOT work.
For example, we never could get Fisher kernels to work well.
We think it is important that people learn about how to combine
various component technologies into a complete system.

There was also some work on score combination between SVMs and
HMMs that we did not discuss in this paper, but is covered
in the dissertation. You might find that interesting.