Principal component analysis and dataset transformation

Bob Fisher

Principal component analysis can be used to analyze the structure of a data set or allow the representation of the data in a lower dimensional dataset (as well as many other applications).

Let tex2html_wrap_inline43 be a set of N column vectors of dimension D. Define the scatter matrix tex2html_wrap_inline49 of the data set as
displaymath51
where tex2html_wrap_inline53 is the mean of the dataset
displaymath55

The d largest principle components are the eigenvectors tex2html_wrap_inline59 corresponding to the d largest eigenvalues. d can be chosen arbitrarily with d < D. The eigenvectors of S can usually be found by using singular value decomposition.

The dominant eigenvectors describe the main directions of variation of the data. For example, if a dataset had 2 large eigenvalues, then the data variation is described largely by linear combinations of the 2 corresponding eigenvectors (ie. the data is largely coplanar).

The d eigenvectors can also be used to project the data into a d dimensional space. Define
displaymath71
The projection of vector tex2html_wrap_inline73 is tex2html_wrap_inline75. The corresponding scatter matrix tex2html_wrap_inline77 of the vectors tex2html_wrap_inline79 is:
displaymath81
The matrix W maximizes the determinant of tex2html_wrap_inline77 for a given d.



Bob Fisher
Friday June 15 17:50:17 BST 2001