Tag Archives: principal component analysis

Computing PCA with SVD

PCA is a great tool for performing dimensionality reduction. Two reason you might want to use SVD to compute PCA:

  • SVD is more numerically stable if the columns are close to collinear. I have seen this happen in text data, when certain terms almost always appear together.
  • Spark's PCA implementation currently doesn't support very wide matrices. The SVD implementation does however.

Singular Value Decomposition (SVD)

Below we briefly recap Singular Value Decomposition (SVD).

Let \mathbf{A} be a m \times n matrix, the singular value decomposition gives

\mathbf{A} = \mathbf{U}_{m\times m}\mathbf{\Sigma}_{m\times n}\mathbf{V}^\top_{n\times n}

\mathbf{U} is an orthonormal matrix and is the eigenvectors of \mathbf{A}\mathbf{A}^\top.

\mathbf{V} is an orthonormal matrix and is the eigenvectors of \mathbf{A}^\top\mathbf{A}.

\mathbf{\Sigma} is a diagonal matrix and contains the square-roots of the eigenvalues of\mathbf{A}\mathbf{A}^\top and  \mathbf{A}^\top\mathbf{A} e.g.

\mathbf{\Sigma}=\begin{pmatrix}\sqrt{\lambda_1}&0&\cdots& 0\\0 &\sqrt{\lambda_2} & \cdots &0\\\vdots & \vdots & \ddots & \vdots \\0&0& \cdots &\sqrt{\lambda_k}\end{pmatrix}

Remember, as \mathbf{U} is an orthonormal matrix \mathbf{U}^\top=\mathbf{U}^{-1}

Computing PCA

Start with the standard steps of PCA:

  1. Mean centre the matrix \mathbf{A}
  2. Optionally scale each column by their standard deviation. You may want to do this if the variables are measured on different scales.

We noted in the previous section that \mathbf{V} is the eigenvectors of \mathbf{A}^\top\mathbf{A} (the covariance matrix). Thus the principal component decomposition is

\begin{equation} \begin{split}\mathbf{T} & = \mathbf{A}\mathbf{V}\\ & =\mathbf{U}\mathbf{\Sigma}\mathbf{V}^\top\mathbf{V} \\ & =\mathbf{U}\mathbf{\Sigma}\end{split}\end{equation}

To reduce the dimensionality of \mathbf{A}, select the k largest singular values (k \le n), select the first k columns from \mathbf{U} and the k \times k upper-left part of \mathbf{\Sigma}. The reduced dimensionality is given by

\mathbf{T}_k =\mathbf{U}_k\mathbf{\Sigma}_k