Appendix A
Results for ICA
Results for SVM
Comparison between PCA, ICA and SVM
Front end tool for comparing the various classification approaches
SCENIC BEAUTY ESTIMATION USING INDEPENDENT COMPONENT
ANALYSIS AND SUPPORT VECTOR MACHINES

X. Zhang, V. Ramani, Z. Long, and Y. Zeng

Image Group
Department of Electrical and Computer Engineering
Mississippi State University, Mississippi State, Mississippi, 39762
{zhang, ramani, long, zeng}@isip.msstate.edu
ABSTRACT
The main focus of this project is to apply Independent Component Analysis (ICA) and Support Vector Machine (SVM) class modeling techniques to the problem of estimating the scenic beauty of forestry images. This work originated from the United States Forest Service's (USFS's) need to automatically produce forest management decisions which balance the timber needs with the desire of the public to preserve natural beauty. Data modeling techniques such as Principle Component Analysis (PCA) have been applied to this problem with limited success. We believe ICA is a more appropriate technique for the general classification problem since it is able to find a linear coordinate system for the data where the transformation is free to be non-orthogonal. SVMs are also promising because they perform well when the classes are inherently separated by a nonlinear decision surface - clearly human subjective decisions are highly nonlinear. In this work, both ICA and SVMs are tested on a standard subset of 637 images and are used to classify each image as either high scenic, medium scenic, or low scenic beauty. The classification results are compared with the human subjective ratings on a number of disjoint test sets.
1.   Introduction
The United States Forest Service has initiated an effort to develop automatic methods for managing forest resources. These methods use forestry images to determine the utility of a plot of forest land both in terms of timber use and scenic quality. Several data modeling algorithms have been used for the scenic beauty estimation of forestry images including Principal Component Analysis (PCA) and Decision Trees. Each has found only limited success on this task. In this paper, we explore the use of two promising new techniques, Independent Component Analysis (ICA) and Support Vector Machines (SVMs).
Independent Component Analysis [5] is the name for a family of techniques which in recent years have produced interesting results on multi-source and single-source audio data, natural images, etc. It is a simple, general purpose method for transforming ensembles of multivariate data into "natural" coordinate systems. When the data is transformed by the ICA transformation, the resulting class variables are said to be as statistically independent as possible. This is a considerable improvement over PCA which considers only orthogonal mappings. Statistical independence is particularly important in the USFS problem where the features are known to have a high degree of overlap.
Support Vector Machine (SVM) [9] is a machine learning technique which has been effective in many pattern matching applications such as face recognition and phone classification. With SVMs, input vectors are mapped into a higher-dimensional feature space through a nonlinear mapping. In this space a linear classification decision region is constructed. This decision region, when mapped back into the original feature space, can take a nonlinear form. The primary advantage of this technique is that it is able to model highly nonlinear data and special properties of the decision surface ensure good generalization. In our work we use the public-domain SVMlight package which provides access to many different SVM kernels.
A total of 45 features are extracted from the images including color content, density of long lines, entropy and fractal dimension. We examined many combinations of features to find the one which match the human subjective rating best and thus gave the lowest error rates. The best error rates achieved using ICA and SVMs are 33.44% and 32.2%, respectively. These results are superior to all previously reported results on this problem. Encouraged by these results, we believe that we can combine these two techniques to produce a better classification scheme. By using ICA as a front-end to SVMs, we will be able to supply optimally separated distributions to the SVMs, thereby constructing a better nonlinear decision region. We expect that the classification capability of the combined system will be a great improvement over using either technique independently.
2. Theory
2.1. Principle Component Analysis
Principal component analysis (PCA) [2] is commonly used for data compression by dimension reduction. It is supposed that the first principal component of a sample vector is the direction along which there is the largest variance over all samples. This direction corresponds to the eigenvector associated with the largest eigenvalue. The kth principal component is chosen to be the linear combination of the input features that has the largest variance, under the constraint that it is also uncorrelated to the previous k-1 principal components. This approach is well suited for data compression, where the objective is to transmit data with minimal distortion. It also finds application in pattern classification, where the consideration is that the direction along which there is maximum variation is also most likely to contain the information about the class discrimination. The method of principal component analysis depends on feature scaling. Multiplication of the nth component of the input feature vector by some constant cannot possibly affect the information contained in the nth component about the class with which the input feature vector is associated. However, as this component begins to point in the direction of the nth feature component it may not contain any class-discrimination information.            
2.2. Linear Discriminant Analysis
Linear discriminant analysis [7] easily handles the case where the within-class frequencies are unequal. This method maximizes the ratio of the between-class variance to the within-class variance in any particular data set, thereby guaranteeing maximal separability. The method of linear discriminant analysis is applied to such problems as image classification and speech recognition. 
The prime difference between LDA and PCA is that PCA does more of feature classification while LDA does more of data classification. LDA performs better than PCA in cases where data between two classes overlaps or the features are highly correlated. In PCA, the shape and location of the original data sets change when they are transformed to a different space whereas LDA doesn't change the location but only tries to provide more class separability and draw a decision region between the given classes. This method also helps to obtain a better understanding of the distribution of the feature data.
2.3. Decision Trees
A decision tree [7] classifier splits the N-dimensional feature space into some unique regions by a sequential method. The algorithm is such that every class need not be tested to arrive at a decision. This becomes advantageous when the number of classes is very large. Moreover, unlike many other training algorithms, this algorithm is guaranteed to converge whether or not the feature space is linearly separable. The objective of a symbolic decision-learning system is to provide an inference for classifications of all events, that is, to generalize the partial knowledge represented in the initial database of events. In other words, the procedure is to partition the event space with different decisions. Ideally, this should be an equivalence partition with large blocks which increase the comprehensibility of the generated knowledge and usually improve generalization properties. However, in general, the generated partition may neither be complete nor consistent. Decision tree classifier systems are made up of two different approaches: one to build the tree, another for knowledge inference. Each training instance contains a set of features and the associated classifications. Each internal node is labeled with an attribute and has outgoing edges corresponding to domain values of the attribute. And each node in the tree contains a subset of the training events which classifies the conditions leading to the node.
2.4. Independent Component Analysis
As mentioned previously in this paper, independent component analysis stands for a family of techniques. Similar to principal component analysis, it finds a linear coordinate system for the data, transforming it by matrix  into a new basis set. However, unlike PCA, the transformation is free to be non-orthogonal. Furthermore the axes are sensitive to statistics of all orders, while in PCA, only the second-order statistics of the covariance matrix is used. When the data is transformed by , the resulting output variables are supposed to be as statistically independent as possible.
2.4.1 Algorithm
The specific ICA algorithm [5] used in this project is described as follows.
Assuming we have a training set . We are to find a transformation , which is supposed to have the properties mentioned above.
First, prewhitening  by 
where .
The matrix  is then initialized to the identity matrix, and trained using the logistic function version of

in which  evaluates as , and , . After enough iterations of training we are able to determine the matrix of .
2.5. Support Vector Machines
Support vector machine (SVM) is a new technique for classification problem. The main objective of this technique is to construct an optimal hyperplane, which uses a small part of training set as support vectors. If the training vectors are separated without errors by an optimal hyperplane, the expected error rate on a test sample is bounded by the ratio of the expectation of the support vectors to the number of training vectors. Since this ratio is independent of the dimension of the problem, if we can find a small set of support vectors, good generalization is guaranteed.
2.5.1 Basics
First, we will have a look at the simplest case: linear machines trained on separable data.
 Suppose we have 2 classes. Label the training data ,
 we would like to find a separating hyperplane , where  is normal to the hyperplane.
This hyperplane separates the positive examples from the negative ones. Let d+ (d-) be the shortest distance from the separating hyperplane to the closest positive (negative) example, then d++d- is called the "margin" of the separating hyperplane. 
Here, the support vector machine just looks for the separating hyperplane with the largest margin. This is shown in [9] to be equivalent to the following problem:
Given the inequalities:
                              (1)
find a pair of hyperplanes:
                   H1:     
                   H2:    
which give the maximum margin by minimizing , subject to constraints (1).
Reformulating the problem using Lagrangian method, we have the Lagrangian:
  (2)
It is further explained in [8] that minimizing (2) is the same as maximizing
                       (3)
Therefore support vector training amounts to maximizing  with respect to the , subject to constraints  and positivity of the , with solution given by
                                                       (4)
In the solution, those points for which  are called "support vectors", and lie on one of the hyperplanes H1, H2. For these machines, the support vectors are the critical elements of the training sets. They lie closest to the decision boundary; and the removal of all other training points will not change the separating hyperplane.
Once a Support Vector Machine has been trained, for a test vector , we simply determine which side of the decision boundary it lies, and then classify it to the corresponding class.
2.5.2 Non-separable case
For non-separable data, the constraints of (1) should be modified into
                            (5)
                  (6)
                                                       (7)
and the solution is still something like (4).
2.5.3 Nonlinear support vector machines
The above SVM algorithm on linear data set can be extended to nonlinear data. The general idea is that we can map the nonlinear data into another space in which we can still do a linear separation on the mapped data. To do the mapping, we need to find some "kernel function". The detailed discussion on these nonlinear problems can be found in [9].
3. Database
The database to be used [1] is based on two sets of images taken approximately four years apart. The first set consists of photographs taken during 1990-91. The second set consists of photographs taken during 1994-95. The images included in the database were taken from a study spanning four years in the Ouachita Forest in Arkansas. Photographs taken under controlled conditions were digitized using an extremely high quality scanning process and converted into computer readable data.
The area under study was partitioned into four blocks, with each block subdivided into five plots. The database includes images from each plot during each of the four seasons in the year. The plots were photographed both in the 1990-91 period and the 1994-95 period. Each plot was photographed from four different angles. Thus the total number of images in the database is 4 blocks x 5 plots x 4 seasons x 4 angles x 2 years = 640. In addition, 40 images (20 images from the 1990-91 and 20 images from the 1994-95 period) were added as baseline images. A set of 20 images was also included as warm-up slides. These warm-up slides were used to calibrate human performance during subjective testing.
The images were originally delivered in a proprietary format developed by Kodak known as the PhotoCD format (PCD), a format for archiving high-quality photographs. Due to the proprietary nature of the PCD format, the images were converted into the Portable Pixel Map format (PPM) [1]. The PCD to PPM conversion was done using version 0.6 of pcdtoppm on a Unix machine. 
The PPM images were converted from a 4x PCD resolution into 1536 pixel wide x 1024 pixel high format of PPM. Each image requires about 5 Mbytes of disk space. The PPM P6 formatted image represents each pixel as a sequence of three bytes. The first byte corresponds to the value of the red color encoded on a linear scale of 0 to 255. The second and third bytes represent the values of green color and blue color. Each file also includes information about the image such as the photographic tray and slide number, the human subjective judgement of 1 to10, the Scenic Beauty Rating (SBE), the Normalized Scenic Beauty Rating (SBEZ), the block number, the treatment, the plot number, the date, the angle, and other miscellaneous information [1]. In this project, we refer to the images in three general classes, that is, least scenic beauty estimate (LSBE), medium scenic beauty estimate (MSBE), and high scenic beauty estimate (HSBE).
4. Experiments
In this project, both ICA and SVMs are tested on a standard subset of 637 images. There are three classes of images in this subset, that is, high scenic beauty, medium scenic beauty and low scenic beauty. The images are further divided into 4 groups, with each group consisting of one training set and one testing set. The training set is used for training, in which the characters of different kinds of images are learnt by ICA or SVMs. While the testing set is used for testing how well the algorithms have learnt, which is indicated by the error rates we got.
A total of 45 features are extracted from the images including color content, density of long lines, entropy and fractal dimension. We examined the following combinations of features:
1) All 45 features
2) Red + Green + Blue (RGB)
3) Red + Green + Blue + Long Line (RGB + LL)
4) Red + Green + Blue + Long Line + Entropy (RGB + LL + ENT) 
5) Red + Green + Blue + Long Line + Fractal Dimension (RGB + LL + FRACT)
For each combination of features, we did training and testing using ICA or SVMs on all four groups of data sets. Then we got statistics based on these experiments. By comparing the statistical results generated from different combinations of features, we can find some good features which are more correlated with the human subjective standards of scenic beauty than the other features do.
The implementation of ICA was based on information maximization theory. If the output entropy is maximized, then the components of the output vector must be statistically independent.
We got the transition matrices for hsbe, msbe and lsbe class separately, and transformed the training data into the new basis set. Then the means and covariance matrices were calculated. The test data were also transformed into the 3 basis sets separately and its distance to each class was calculated. By comparing these distances, the test images were classified accordingly. 
When we implement SVM in this project, we must find three classifiers by using the training data sets for each class of hsbe, msbe and lsbe respectively. Then for the test image, we calculate the distance of its feature vector to each of the three classifiers. By comparing the distances we can determine which class the test image belongs to.
For SVMs, we used the public-domain SVMlight package. In the experiments, we chose different kernel functions and various parameters for those functions to find the best classification results.
5. Results
5.1. ICA
5.1.1 Error Rates
The error rates for all the combination of the features are listed in the following table.
features	set1	set2	set3	set4	mean	variance
ALL	35.22	33.54	31.87	33.12	33.44	5.7457
RGB	34.59	33.54	32.50	33.12	33.44	2.3185
RGB+
LL	34.59	33.54	33.12	33.12	33.59	1.4443
RGB+
LL+ENT	34.59	34.18	33.12	32.50	33.60	2.7569
RGB+
LL+FRA	35.85	33.54	32.50	33.12	33.75	6.4135

5.1.2 Confusion Matrix
The following confusion matrix is for the experiment using RGB+LL+ENT on data set 1, which gave the error rate of 34.9%. 
   Measured	Subjective		
	hsbe	msbe	lsbe
hsbe	0	1	2
 msbe	28	102	26
 lsbe	0	0	0

5.2. SVM
5.2.1 Error Rates
The error rates for all the combination of the features are listed in the following table.
features	set1	set2	set3	set4	mean	variance
ALL	34.6	34.2	31.2	31.9	33.0	8.4475
RGB	35.2	34.8	32.5	32.5	33.8	6.3300
RGB+
LL	32.1	32.3	31.9	32.5	32.2	0.2222
RGB+
LL+ENT	35.2	34.2	31.9	32.5	33.4	6.9300
RGB+
LL+FRA	35.2	35.2	32.5	31.2	33.5	12.0675

5.2.2 Confusion Matrix
The following confusion matrix is for the experiment using RGB+LL on data set 1, which gave the error rate of 32.1%.
   Measured	Subjective   		
	hsbe	msbe	lsbe
hsbe	3	0	0
 msbe	25	100	23
 lsbe	0	3	5

5.3. Histogram
We got the histograms for experiment results by ICA and SVMs respectively. The histograms are shown in the previous page..
6. Discussions
We compared the error rates got by using ICA, SVMs and PCA respectively. The comparison is shown above in the histogram. From the histogram we note that for any combination of features that we tested, both ICA and SVMs performs much better than PCA with regard to the error rate.
The best error rate achieved by ICA is 33.44%, using the combination of RGB and the best error rate achieved by SVMs is 32.2%, using the features of RGB+LL. While previously reported best results on this problem are 36.6% for PCA, 38.47% for LDA, and 43% for Decision Trees, respectively. Obviously the results got by ICA and SVMs are superior to those previous ones. Encouraged by these results, we believe that we can combine these two techniques to produce a better classification scheme. By using ICA as a front-end to SVMs, we will be able to supply optimally separated distributions to the SVMs, thereby constructing a better nonlinear decision region. We expect that the classification capability of the combined system will be a great improvement over the system using either technique independently.
7. Evaluation
The system performance using various algorithms were evaluated using a comparative study with the help of a front end demo where the tool was developed in Tcl-Tk [A].  
8. Acknowledgements
Sincere thanks to Aravind Ganapathiraju for helping us implement SVM in the USFS system. Thanks to Suresh Balakrishnama for giving us a real lead into ICA implementation. Thanks to Jonathan Hamaker, Neeraj Deshmukh and Dr. Joseph Picone for their immense help in making our project a success. Finally, we also would like to thank the students at ISIP for helping us in various aspects to implement the project.
9. References
[1]	N. Kalidindi, A. Le, J. Picone and V. A. Rudis, "The USFS Scenic Beauty Estimation Database," USFS Southern Research Station, Institute for Signal and Information Processing, Mississippi State University, Mississippi State, Mississippi, USA, December 1996.
[2]	N. Kalidindi and J. Picone, "Scenic Beauty Estimation of Forestry Images," USFS Project Final Report, Institute for Signal and Information Processing, Mississippi State University, Mississippi State, Mississippi, USA, December 1997.
[3]	N. Kalidindi, A. Le, L. Zheng, H. Yaquin, J. Picone and V. A. Rudis, "Scenic Beauty Estimation of Forestry Images," Proceedings of IEEE Southeastcon, pp. 337 - 339, Blacksburg, Virginia, USA, April 1997.
[4]	R. O. Duda and P. E. Hart, Pattern Classification and Scene Analysis, John Wiley and Sons Inc., New York, New York, USA, 1973.
[5]	A. J. Bell and T. J. Sejnowski, "An Information-Maximization Approach to Blind Separation and Blind Deconvolution," Technical Report No. INC-9501, Institute for Neural Computing, University of California at San Diego, San Diego, California, USA, February 1995.
[6]	A. K. Jain, Fundamentals of Digital Image Processing, Prentice-Hall International Inc.,    Englewood Cliffs, New Jersey, USA, 1989.
[7]	J. Picone, "Applications of High Performance Statistical Modeling to Image Analysis of Forest Structure," ISIP Proposal No. 03-98, Institute for Signal and Information Processing, Mississippi Sate University, Mississippi State, Mississippi, USA, July 1998.
[8]	R. Brown, S. Balakrishnama, and J. Picone, "Scenic Beauty Estimation using Linear Discriminant Analysis," MS State ECE 4773 Semester Project, Institute for Signal and Information Processing, Mississippi State University, Mississippi State, Mississippi, USA, December 1997.  
[9]	C. J. C. Burges, "A Tutorial on Support Vector Machines for Pattern Recognition," available at http://svm.research.bell-labs.com/SVMdoc.html, Bell Laboratories, Lucent Technologies, USA, December 1998.
[10]	V. Vapnik, " The nature of statistical learning theory," Springer Verlag, New York, 1995.
[11]	V. Vapnik, S. Golowich and A. Smola, "Support Vector Method for Function Approximation, Regression Estimation, and Signal Processing", Advances in Neural Information Processing Systems,1996 .