Yumi's Blog

Review on PCA

Principal Component Analysis (PCA) is a traditional unsupervised dimensionality-reduction techinique that is often used to transform a high-dimensional dataset into a smaller dimensional subspace. PCA seeks a linear combination of variables such that the maximum variance is extracted from the variables. In this blog, I will review this one of the most used applied mathematics teqhiniques.

Idea

To discuss the PCA, let's introduce some notations to describe sample dataframe:

$$ \boldsymbol{X}=\left[ \begin{array}{lll} x_{1,1} & \cdots & x_{1,\textrm{Nfeat}}\\ x_{2,1} & \cdots & x_{2,\textrm{Nfeat}}\\ \cdots & & \cdots\\ x_{\textrm{Nsample},1} & \cdots & x_{\textrm{Nsample},\textrm{Nfeat}} \end{array} \right] \in R^{\textrm{Nsample } x \textrm{ Nfeat}} $$

Let also $\boldsymbol{x}_{i}$ be a sample vector of length Nfeat for the $i$th sample.

With PCA, we can find the set of weights $\boldsymbol{w}=[w_1,\cdots,w_{Nfeat}]^T \in R^{\textrm{Nfeat}}$ that can create a "good" feature as a linear function of original features:

$$ f_i = \boldsymbol{x}_{i}^T \boldsymbol{w} = \sum_{f=1}^{Nfeat} x_{i,f} w_{f} $$

The question is, what to be considered as a "good" feature? In PCA, "good" feature can capture a large amount of variability in original data. So mathematically, we can find the weights $\boldsymbol{w}$ to create a "good" feature as:

$$ \underset{\boldsymbol{w}, ||\boldsymbol{w} ||_2^2 = 1}{max} Var(f_i) $$

This optimization is constrained to $||\boldsymbol{w} ||_2^2 = \sum w_f^2 = 1$ because otherwise you can make the magnitude of $w_f$ arbitrarily large, and there is no solution to the optimization problem. So the PCA involves an constraint optimization problem looking for these weights $\boldsymbol{w}$! Interestingly, the solution to this constraint optimization exists and it is the eigenvector corresponding to the largest eigenvalue. In the next section, I will discuss more details to obtain this result.

The largest eigen vector as the first principal component

In the first section, we learned that the PCA is a constraint optimization problem. Let's expand this objective function:

\begin{array}{rl} Var(f_i) &= Var( \boldsymbol{x}_{i}^T \boldsymbol{w} ) \\ &= \boldsymbol{w}^T Var( \boldsymbol{x}_{i} ) \boldsymbol{w}\\ \end{array} where $Var( \boldsymbol{x}_{i} ) \in R^{\textrm{Nfeat }x\textrm{ Nfeat}}$, and by defenition, $Var( \boldsymbol{x}_{i} ) = E( \boldsymbol{x}_{i}\boldsymbol{x}_{i}^T ) - E( \boldsymbol{x}_{i})E(\boldsymbol{x}_{i} )^T$. Assume that each column of $\boldsymbol{X}$ is standardized to have mean 0 then $E(\boldsymbol{x}_i)=\boldsymbol{0}$, so $$ Var( \boldsymbol{x}_{i} ) = E( \boldsymbol{x}_{i}\boldsymbol{x}_{i}^T ) $$

Sample unbiased estimate of this variance matrix $Var(\boldsymbol{X})$ is proportional to its cross product $\boldsymbol{X}^T \boldsymbol{X} \in R^{\textrm{Nfeat}x\textrm{Nfeat}}$.

This sample variance is a symetric matrix: transpose of the matrix is identical to the original matrix. This can be easily seen by $(\boldsymbol{X}^T \boldsymbol{X})^T=\boldsymbol{X}^T \boldsymbol{X}.$

Principal Axes Theorem

In the previous section, we learned that the objective function of the constraint optimization problem is expressed with $ \boldsymbol{X}^T\boldsymbol{X}$. To further expand the sample variance $ \boldsymbol{X}^T\boldsymbol{X}$, I give you a Principal Axes Theorem, which takes advantage of the symmetry of the sample variance:

Let $A$ be an $n$ by $n$ symmetric matrix. Then there is an orthogonal change of variable, $\boldsymbol{a}=P\boldsymbol{b}$, that transforms the quadratic form $\boldsymbol{a}^T A \boldsymbol{a}$ into a quadratic form $\boldsymbol{b}^T D \boldsymbol{b}$ with no cross-product term, where $D$ is a diagonal matrix with eigenvalues of $A$ in its diagonal entries, ordered in decreasing way as $\lambda_1 \ge \lambda_2 \ge \lambda_3 \ge \cdots \ge \lambda_{\textrm{Nfeat}}\ge 0$: $$ \boldsymbol{D}= \left[ \begin{array}{lllll} \lambda_{1} & 0 & 0 &\cdots & 0\\ 0 & \lambda_{2} & 0 &\cdots & 0\\ 0 & 0 &\lambda_{3} &\cdots & 0\\ \vdots& & &\ddots & \vdots\\ 0 & 0 & 0 &\cdots & \lambda_{\textrm{Nfeat}} \end{array} \right] \in R^{\textrm{Nfeat } x \textrm{ Nfeat}} \; $$ and $\boldsymbol{p}_f$ are corresponding unit eigenvectors ($||\boldsymbol{p}_{f}||=1$): $$ \boldsymbol{P} = \left[ \begin{array}{rlll} \boldsymbol{p}_{1} & \cdots & \boldsymbol{p}_{\textrm{Nfeat}} \end{array} \right],\in R^{\textrm{Nfeat } x \textrm{ Nfeat}} $$

Some observations of the theory:

  • Observation 1: "$\boldsymbol{a}=P\boldsymbol{b}$" is an orthogonal change of variable, meaning that $\boldsymbol{p}_c^T \boldsymbol{p}_k = 0$ if $k\ne c$. This orthonormal matrix $\boldsymbol{P}$ comes with the nice property that inverse of $P$ is the same as the transpose of it: $\boldsymbol{P}^{-1} = \boldsymbol{P}^T$. To see this:

$$ \boldsymbol{P}\boldsymbol{P}^T= \left[ \begin{array}{llll} p_1^T p_1 & p_1^T p_2 & \cdots & p_1^T p_{\textrm{Nfeat}}\\ p_2^T p_1 & p_2^T p_2 & \cdots & p_2^T p_{\textrm{Nfeat}}\\ \vdots & & &\vdots\\ p_{\textrm{Nfeat}}^T p_1 & p_{\textrm{Nfeat}}^T p_2 & \cdots & p_2^T p_{\textrm{Nfeat}} \end{array} \right] =\boldsymbol{I} $$

  • Observation 2: $\boldsymbol{p}_k$ is an eigenvector of $A$ with corresponding eigenvalue $\lambda_k$. This means that:

$$ \lambda_k \boldsymbol{p}_k = A\boldsymbol{p}_k $$

Therefore:

\begin{array}{rl} \boldsymbol{P}\boldsymbol{D} &= A \boldsymbol{P}\\ \boldsymbol{P}\boldsymbol{D} \boldsymbol{P}^{-1} &= A\\ \boldsymbol{P}\boldsymbol{D} \boldsymbol{P}^{T} &= A \end{array} The last line is true by Observation 1.

With Principal Axes Theorem in hand, and leting $A=\boldsymbol{X}^T\boldsymbol{X}$, we have:

$$ \boldsymbol{X}^T\boldsymbol{X} = \boldsymbol{P}\boldsymbol{D} \boldsymbol{P}^{T} $$

Back to optimization problem

Now, let's go back to PCA's constraint optimization problem: $$ \underset{\boldsymbol{w}, ||\boldsymbol{w} ||_2 = 1}{max} Var(f_i) $$ In sample scenario, this problem becomes: $$ \underset{\boldsymbol{w}, ||\boldsymbol{w} ||_2 = 1}{max} \boldsymbol{w}^T \boldsymbol{X}^{T}\boldsymbol{X}\boldsymbol{w} $$ Principal Axes Theorem expands this problem to: $$ \underset{\boldsymbol{w}, ||\boldsymbol{w} ||_2 = 1}{max} \boldsymbol{w}^T \boldsymbol{P}^{T}\boldsymbol{D}\boldsymbol{P} \boldsymbol{w} $$

I will represents $\boldsymbol{y} = \boldsymbol{P} \boldsymbol{w}$, and $\boldsymbol{y} =[y_1,\cdots,y_{\textrm{Nfeat}}]^T$. Notice that its norm is 1: \begin{array}{rl} ||\boldsymbol{y} ||_2^2 &= ||\boldsymbol{P} \boldsymbol{w}||_2^2 \\ &= \boldsymbol{w}^T\boldsymbol{P}^T \boldsymbol{P} \boldsymbol{w}\\ &= \boldsymbol{w}\boldsymbol{P}^{-1} \boldsymbol{P} \boldsymbol{w}\\ &= \boldsymbol{w}^T\boldsymbol{I} \boldsymbol{w}\\ &= \boldsymbol{w}^T\boldsymbol{w}\\ &= 1 \end{array} Then, let's consider the upper bound of this objective function.

\begin{array}{ll} \boldsymbol{w}^T \boldsymbol{P}^{T}\boldsymbol{D}\boldsymbol{P} \boldsymbol{w} &= \sum_{f} \lambda_f y^2_{f}\\ &\le \sum_{f} \lambda_1 y^2_{f}\\ &= \lambda_1 \sum_{f} y^2_{f}\\ &= \lambda_1 || \boldsymbol{y}||_2^2=\lambda_1\\ \end{array}

Finally, notice that $\boldsymbol{w}^T \boldsymbol{P}^{T}\boldsymbol{D}\boldsymbol{P} \boldsymbol{w} = \lambda_1$ when $\boldsymbol{w}=\boldsymbol{p}_1$, the first principal component, because:

$$ \boldsymbol{p_1}^T \boldsymbol{P}^{T}\boldsymbol{D}\boldsymbol{P} \boldsymbol{p_1} =[1,0,\cdots,0] \boldsymbol{D} \left[ \begin{array}{r} 1\\ 0\\ \cdots\\ 0 \end{array} \right] =\lambda_1 $$

We just showed that the weights that create "best" linear combination of original feature is eigenvectors.

Reference

A TUTORIAL ON PRINCIPAL COMPONENT ANALYSIS

Comments