General comments to all reviewers: We appreciate the excellent feedback. Please note that the paper as it was originally submitted, was already at the transactions limits. Hence, we have tried to thoroughly address all the comments below without increasing the overall length of the paper. This work is described in detailed in Ganapathiraju's dissertation, which is referenced heavily in the work. This dissertation is available on our web site to facilitate access: http://www.isip.msstate.edu/publications/books/msstate_theses/2002/support_vectors/thesis/thesis_final.pdf Reviewer No. 1: DETAILED COMMENTS --------------------------------- > 1) Good tutorial on the evolution of pattern recognition > techniques as applied to speech recognition. Good description > of the SVM and reasons for its use. The latter may not be well > known in the signal processing community. This is one reason we are interested in publishing this work in this special issue. In fact, since we started this work in 1997, and demonstrated the first LVCSR applications in 1998, SVMs have found their way into many speech applications. Most recently, results on speaker verification have been demonstrated by several sites. We are currently working on similar things. > 2) Inadequate explanation and justification for eq. (13). This > only gives local information about the observable pdfs near the > decision boundaries. We don't have enough space to show the experimental fits to the data, etc. We added more explanations in the first paragraph of the section title "Posterior Estimation", including a reference to Tipping's and Kwok's work, which we followed, and which address this issue in greater detail. Also, Platt's work, which is referenced in the last paragraph in this section, addresses issues in using cross-validation methods to estimate these parameters. > 3) Inadequate explanation of the overall recognition system. I > assume that it's a standard IBM style model but that must be > made clear and explicit. In the first sentence in section 2, we reference statistical methods as described by Jelinek, and explicitly state these are what we are using through this work. This first two paragraphs in this section essentially briefly review the standard statistical approach to speech recognition. > 4) Inadequate evaluation of the results. How can we conclude > that there is a performance improvement. Table 1 is not > enough. Table 1 demonstrates the efficacy of the basic classifier on speech-like data. This was a significant historical step forward which motivated us to do more work. Table 2 also describes improvements. We have added additional discussion and reworked the explanation of Table 2. Hopefully, the significance of Table 2 is clearer. Note that we are not claiming astounding improvements in performance. This is quite natural for a first attempt at improving a speech recognition system. Row no. 3 in Table 2 is very significant, since it suggests that there could be significant improvements in performance IF we solve the supervised training problem. This will be the topic of a future paper. > 5) Inadequate discussion of the pitfalls of the SVM when > projecting into very high dimensional feature spaces. This is a good point. Due to length constraints, we had to drop many of these issues. We have published separate work on this in: A. Ganapathiraju and J. Picone, "Support Vector Machines For Automatic Data Cleanup," Proceedings of the International Conference of Spoken Language Processing, vol. 4, pp. 210-213, Beijing, China, October 2000. Overcoming some of the practical problems associated with training SVMs was a major contribution. Again, the details of this are described in Ganapathiraju's dissertation and referenced in this paper. Reviewer No. 2: DETAILED COMMENTS --------------------------------- > p. 1. "Simplistic techniques...": this derogates early work > in recognition and is an example of what I call a cavalier or > immodest > writing style. I suggest "simple techniques" instead. Corrected. > "relatively fragile" --> relative to what? Drop "relative", and > be more specific in describing system fragility. We have added more explanation and a reference. > "we demonstrate this" --> instead "that" between last 2 words. Corrected. > p. 2-4. A bit too much review of HMM/EM fundamentals. Condense. > We have shortened this section by one page. We could drop the discussion Of ANNs, but several people have argued about this point when we have Presented this work. Therefore, we think it is important to cite ANN material in this way. > p. 3. > > "based as an optimization" --> "based ON an optimization"? > > "in this example any amount of effort ... will not" --> "in this > example > NO amount of effort". > Fixed. > p. 4. "The primary different between ML-based HMM... the wrong > model > is used". This is a good point, but the phrasing is confusing. > Perhaps > something like "the objective criterion for the latter reflects > classification > performance even if the wrong probabilistic model is used"...? > This paragraph was deleted in an effort to reduce the size of this section. > p.5-8: This overview section seems slightly long. > > p. 5. "Closed-loop optimality". I assume this refers to > performance > on the training set. I don't believe that "closed-loop" is a > common > term in machine learning or speech recognition. > > "the design of a classifier is essentially a process" --> "the > design > of a classifier is essentially THE process" > > p. 6, first paragraph introduces the term "loss function", but > then > this is re-introduced and italicized further down the page. > Italicize/emphasize the term at its first use. > > p. 8-10. This section summarizes SVMs, but this is all review > material. Suggest condensing. Pages 5-10 have been condensed to only the essentials. We are concerned That this reduces the tutorial value of the paper. > > p. 11. On page 11 of a paper whose conclusion starts on page 16, > > finally we're starting to hear details of what the authors did > in this study. I suggest adding one or two lines describing > in more detail the ML-approach to estimating A and B of > the sigmoid. > > "expense of increased computations" --> "expense of increased > computation time" > Extensive modifications were made to this description, including a figure showing the results. This changes should make it much easier for the reader to understand the details of this work, and include references to the definitive sources for this approach. > p. 12. "the phone 's'" --> "the phone /s/". However, the point > that the authors are trying to make is independent of the > phone identity, so I suggest not introducing the phone > identity. > Good point - fixed. > p. 13. What do the authors mean exactly by "we use segments > composed > of the three sections"? I assume three equal-dimension vectors > are concatenated. What is the dimensionality of these vectors? > How are they themselves derived? The following text mentions > LPC (log-area parameters), but it's not clear if this is how > the > segments were calculated. > We added a paragraph describing this in more detail, and an associated figure. > p. 14. "This was a good database" --> "This is a good database" > Corrected. > "We have observed similar trends on a number of static > classification > tasks" : details (or references) please. Let me first explain the source of this comment, and then how we addressed it in the paper. We have used a number of well-known pattern recognition data sets in our pattern recognition course, and found that SVMs work very well on these. In fact, we developed a Java applet that demonstrates some of this. Obviously, you can't publish such things, so we don't have conference or journal publications to point to. Since this paper was written, several results have recently been published on different speech tasks. I think it is important to show this to people, so they know our results are not anomalous. We have added a URL reference to the pattern recognition applet. I think this is useful to people who want to explore static classification tasks. I have also added a reference to a paper published at Eurospeech'2003. > p. 16. "When we allow the SVM to decide the best segmentation > and > hypothesis combination by using the N-best segmentations..." > This point regarding the segmentations, either decided by the > baseline HMM system or by the SVMs, is very unclear. A lot more > detail > is needed on the authors' proposed technique. The authors have > provided > several figures describing the fundamentals of SVMs, but no > figures describing their own recognition system architecture. > Is their method essentially an N-best rescoring method? Will > they > always require a baseline HMM to provide N-best lists, or > could the SVMs be used to supplant the HMM? > We added a figure at the end of section 5 that summarizes the overall system architecture. We have also modified the discussion in section 6 to better explain the difference between using reference segmentations and hypothesized segmentations. We have also tried to be very clear that in this initial work we are using a hybrid approach. This is sufficient for proof of concept, but obviously not acceptable for mature technology. Subsequent research will be published on more integrated approaches. This is mentioned briefly in the summary. > "Table 1" -- this should be Table 2, I think. > Corrected. > I strongly urge the authors to give slightly less details > about review material that is already published, and many > more details on what they themselves did in this study. We appreciate this reviewer's detailed comments and have tried to accommodate them. Our only concern is that this review material is still not well-known through the signal processing community, particularly the speech community, and needs some forum for publication. Ironically, the SP Magazine rejected our request to publish an extensive review paper on risk minimization, and recommended we pursue publication in the Signal Processing transactions. Reviewer No. 3: DETAILED COMMENTS --------------------------------- > This paper proposes two deep, theoretically interesting research > topics, then drops both of them without warning. In the end, > the experiments performed by the authors seem to be nothing more > than standard SVM classification experiments, performed almost > step by step according to the instructions in Vapnik's > textbook. I hope that this reviewer will appreciate that application of these techniques to speech recognition is not trivial. We have been upfront with the scope of this paper - it presents the first application of SVMs to LVCSR tasks. We are not claiming innovations with respect to the core SVM theory. > Section 2 discusses the well-known problems with ML, and > suggests MCE as a solution. Section 3 discusses the principle > of structural risk minimization. Combination of these concepts > would be very interesting, as MCE is itself a type of ERM; > unfortunately, both of these concepts are dropped without > warning, and the paper never returns to them. We are currently working on a comparison of MMIE, MCE, and MLLR with approaches such as SVMs and RVMs. Implementation of effective MMIE and MCE approaches in a framework that is compatible with our SVM and RVM systems is not trivial. We hope this will be the subject of a future journal paper. However, we believe history supports our contention that papers must be published on the efficacy of the basic SVM approach on speech problems before it can be compared to mature competing discriminative training technologies. We would like to publish at least one journal paper that clearly describes the SVM approach in a speech recognition application since there are many choices one can make for each component. In fact, our system when it was first publicly disclosed, is fairly unique in its combination of components. > Section 4 presents the support vector machine, and section 5 > proposes an SVM/HMM hybrid. An SVM/HMM hybrid would be very > interesting. ANN/HMM hybrids use the classification power of an > ANN in order to improve the observation PDF computation of an > HMM. As a result, the hybrid systems often outperform mixture > Gaussian HMMs, at the cost of substantially increased > computational complexity during training. An SVM/HMM hybrid, > using SVM to compute the observation PDFs of an HMM in some way, > would presumably do even better than the ANN/HMM hybrid, > presumably at even greater computational cost. Unfortunately, > this idea is dropped immediately after being suggested, because > it is too computationally complex. Our system is a hybrid SVM/HMM approach. We made some changes, as described above, to make this clearer. We do not have an equivalent ANN/HMM hybrid to compare to since we don't have ANN technology in house. Building a state of the art ANN system is a large task in itself. We hope someone else who is expert in this technology can publish a comparison to our system on comparable tasks. > In the end, the authors perform n-ary classification of > fixed-length segmental feature vectors. As far as I can tell, > the only interaction between the HMM system and the SVM system > is that the HMM is used to find segment boundaries for the SVM > to classify; the final log-probabilities of the HMM and the SVM > are then apparently added together in order to create the final > recognition score. In my view, adding together the scores of > two classifiers, both of which are trained and tested according > to standard published references, does not constitute a "hybrid" > system of sufficient novelty to merit publication. We respectfully disagree though we support the reviewer's efforts to maintain a high threshold for publication. One could argue that almost any innovation in speech recognition, such as the use of HMMs, fits the above description for lack of novelty. In creating an SVM system, there were a lot of issues that had to be explored. Ganapathiraju's dissertation goes into much more detail on many things that were tried that DID NOT work. For example, we never could get Fisher kernels to work well. We think it is important that people learn about how to combine various component technologies into a complete system. There was also some work on score combination between SVMs and HMMs that we did not discuss in this paper, but is covered in the dissertation. You might find that interesting.