Title: Support Vector Machines For Automatic Data Cleanup Authors: Aravind Ganapathiraju, Joseph Picone Support Vector Machines (SVM) are a new class of machine learning technique that learn to classify discriminatively. This paradigm has gained significance in the past few years with the development of efficient training algorithms. The estimation of the parameters for the SVMs is posed as a quadratic optimization process. The result is a set of Lagrange multipliers, one for each training data point. Only a small percent of the training vectors will have a corresponding non-zero multiplier and these data points are called Support Vectors. Though a solution for this quadratic optimization is guaranteed, the number of required computations can be very high depending on the separability of the data and the number of training data points. Several heuristics need to be considered to make SVM training possible is a reasonable amount of time. One of the most common ways to tackle this problem is to divide the optimization into sub-problems whose solution can be easily found. "Chunking" is based on this paradigm where the data is divided into chunks and the functional is optimized for each chunk. It has been proved that this algorithm does indeed give the same solution as a global optimization process but with much less operating memory and time. The algorithm proceeds as follows: 1) choose a chunk of training points (working set) 2) solve the optimization problem defined by the points in the working set. 3) continue until there is no further change in the functional's value Now the trick is to find a working set (chunk) that helps the optimization converge fast. The new chunk of data is chosen such that come constraints are violated the most. For each chunk of data that is optimized, there will be support vectors with their multipliers at the upper bound (which indicates that the support vector is in a region where there is overlap between the classes). These multipliers very often continue to remain at the upper bound. When this happens aver several iterations of the optimization process, it is a good indication that the data point with the multiplier at the upper bound is in the overlap region. The overlap could suggest one of two things - there is inherent overlap in the data or the data point is a case of mislabeling. Either way, it is better to not include this data point in the definition of the classification surface. This unique property of the SVM optimization process can be used effectively in speech recognition in several ways. One application is that of data cleanup. Several databases, especially databases for conversational speech, come with an inherent transcription word error rate that significantly degrades training efficacy. Having the capability to identify mislabeled data in such an environment can be of tremendous help to train acoustic models effectively. Another area where the above property can be applied is the area of confidence measures. Several speech recognizers use confidence measures to guide the training and recognition processes. We can come up with methods to quantify the degree of mislabeling based on the above property and use that as a confidence measure. Preliminary experiments on highly confusable phone pairs in OGI Alphadigits indicate that SVMs do a very good job of identifying mislabeled data and are very consistent. We are in the process of using this feature in SVMs to identify mislabeled data at the word level which translates to identifying transcription errors in databases.