PCA is a great tool for performing dimensionality reduction. Two reason you might want to use SVD to compute PCA:

- SVD is more numerically stable if the columns are close to collinear. I have seen this happen in text data, when certain terms almost always appear together.
- Spark's PCA implementation currently doesn't support very wide matrices. The SVD implementation does however.

# Singular Value Decomposition (SVD)

Below we briefly recap Singular Value Decomposition (SVD).

Let be a matrix, the singular value decomposition gives

is an orthonormal matrix and is the eigenvectors of .

is an orthonormal matrix and is the eigenvectors of .

is a diagonal matrix and contains the square-roots of the eigenvalues of and e.g.

Remember, as is an orthonormal matrix

# Computing PCA

Start with the standard steps of PCA:

- Mean centre the matrix
- Optionally scale each column by their standard deviation. You may want to do this if the variables are measured on different scales.

We noted in the previous section that is the eigenvectors of (the covariance matrix). Thus the principal component decomposition is

To reduce the dimensionality of , select the largest singular values (), select the first columns from and the upper-left part of . The reduced dimensionality is given by