Principal component analysis and dataset transformation

Bob Fisher

Principal component analysis can be used to analyze the structure of a data set or allow the representation of the data in a lower dimensional dataset (as well as many other applications).

Let be a set of N column vectors of dimension D. Define the scatter matrix of the data set as

where is the mean of the dataset

The d largest principle components are the eigenvectors corresponding to the d largest eigenvalues. d can be chosen arbitrarily with d < D. The eigenvectors of S can usually be found by using singular value decomposition.

The dominant eigenvectors describe the main directions of variation of the data. For example, if a dataset had 2 large eigenvalues, then the data variation is described largely by linear combinations of the 2 corresponding eigenvectors (ie. the data is largely coplanar).

The d eigenvectors can also be used to project the data into a d dimensional space. Define

The projection of vector is . The corresponding scatter matrix of the vectors is:

The matrix W maximizes the determinant of for a given d.

Bob Fisher
Friday June 15 17:50:17 BST 2001