PCA

Posted by AH on April 19, 2022

PCA

Data Representation

  • 数据集:${ \mathbf{x} i }{i=1}^n$

    • 每个样本都有$d$个特征:$ \mathbf{x}_ i = [ x_ {i1}, …, x_ {id} ]^T \in \mathbb{R}^d $

    • 可以把数据集表示为矩阵$\mathbf{X} = [\mathbf{x}_1, …, \mathbf{x}_n]^T \in \mathbb{R}^{n\times d}$

    • 可以假设数据都中心化 (centered) 了,即均值为0

  • 注意:这里没有label,是无监督学习

Main Ideas

  • Idea 1: 用低维向量$z_i \in \mathbb{R}^R$线性近似特征向量$x_i \in \mathbb{R}^d$

    \[x_i \approx B z_i, i=1, ..., n\]

    其中:

    • $\mathbf{B} \in \mathbb{R}^{d\times R}$的列向量组成了一个包含R个元素的dictionary

    • $R$是近似的rank,$1\leq R < d$,且理想情况下$R \ll d$

  • Idea 2: 该近似的目标是最小化RSS

    \[(\hat{\mathbf{B}}, \{\hat{\mathbf{z}}_i\}_{i=1}^n) = \arg \min_{\mathbf{B}, \{\mathbf{z}_i\}} \{\sum_{i=1}^n \lVert \mathbf{x}_i - \mathbf{Bz}_i \rVert^2 \}\]

    注意:这里用RSS只是为了简单,可能效果不好

Solution

  • 最优的$R$-element approximation dictionary是
\[\hat{\mathbf{B}} = \mathbf{V}_R \triangleq [\mathbf{v}_1, ..., \mathbf{v}_R]\]

​ 其中${ \mathbf{v}_ r }_ {r=1}^R$是sample covariance matrix $\mathbf{Q}$的最大的$R$个特征值${\mathbf{\lambda}_ r}_ {r=1}^R$对应的特征向量

\[\mathbf{Q} \triangleq \frac{1}{n} \sum_{i=1}^n \mathbf{x}_i \mathbf{x}_i^T = \sum_{j=1}^ d \lambda_j \mathbf{v}_j \mathbf{v}_j^T\]
  • 最优的系数是
\[\hat{\mathbf{z}}_i = \mathbf{V}_R^T \mathbf{x}_i\]
  • PCA将\(\mathbf{x}_i \in \mathbb{R}^d\)投影到由${\mathbf{v}_ r}_ {r=1}^R$张成的子空间中,${\mathbf{v}_ r}_ {r=1}^R$也被称为$R$个主成分

Orthogonal Projection

记矩阵$\mathbf{B}$的列向量张成的子空间(也就是$\mathbf{B}$的列向量的线性组合)为$\mbox{colsp}(\mathbf{B}) \triangleq { \mathbf{Bz} \quad s.t.~\mathbf{z} \in \mathbb{R}^R} \subset \mathbb{R}^d$

则若将向量$\mathbf{x}\in\mathbb{R}^d$投影到$\mbox{colsp}(\mathbf{B})$上,有

\[\mathbf{x} = \hat{\mathbf{x}} + \mathbf{e}\]

其中$\hat{\mathbf{x}} \in \mbox{colsp}(\mathbf{B})$,$\mathbf{e} \perp \mbox{colsp}(\mathbf{B})$

而$\hat{\mathbf{x}}$和$\mathbf{e}$可以这样计算得到:

\[\hat{\mathbf{x}} = \mathbf{P_B x},~ \mathbf{e} = \mathbf{P_B^\perp x}\]

其中$\mathbf{P}_ \mathbf{B} = \mathbf{B}(\mathbf{B}^T\mathbf{B})^{-1} \mathbf{B}^T$,$\mathbf{P}_ \mathbf{B}^\perp = \mathbf{I} - \mathbf{P_ B}$

$\mathbf{P}_ \mathbf{B}$和$\mathbf{P}_ \mathbf{B}^\perp$被称为投影算子,满足$\mathbf{P}_ \mathbf{B} = \mathbf{P}_ \mathbf{B}^T$和$\mathbf{P}_ \mathbf{B} = \mathbf{P}_ \mathbf{B}^2$

Derivation

  1. 首先固定$\mathbf{B}$,优化$\hat{\mathbf{z}}_i$(即最小二乘法)
\[\hat{\mathbf{z}}_i = \arg \min_{\mathbf{z}_i} \lVert \mathbf{x}_i - \mathbf{Bz}_i \rVert^2 = (\mathbf{B}^T\mathbf{B})^{-1} \mathbf{B}^T \mathbf{x}_i\]
  1. 把${\hat{\mathbf{z}}_ i}_ {i=1}^n$代回原问题,可以得到
\[\begin{aligned} \hat{\mathbf{B}} &= \arg \min_{\mathbf{B}} \{\sum_{i=1}^n \lVert \mathbf{x}_i - \mathbf{B}(\mathbf{B}^T\mathbf{B})^{-1} \mathbf{B}^T \mathbf{x}_i \rVert^2\} \\ &= \arg \min_{\mathbf{B}} \{\sum_{i=1}^n \lVert (\mathbf{I} - \mathbf{B}(\mathbf{B}^T\mathbf{B})^{-1} \mathbf{B}^T) \mathbf{x}_i \rVert^2\} \\ &= \arg \min_{\mathbf{B}} \sum_{i=1}^n \{\lVert \mathbf{P_B^\perp} \mathbf{x}_i \rVert^2\} \\ &= \arg \min_{\mathbf{B}} \sum_{i=1}^n \{\lVert \mathbf{e}_i \mathbf{x}_i \rVert^2\} \end{aligned}\]

也就是说PCA选择的$\mathbf{B}$能够最小化投影误差的平方和

进一步地,我们可以定义上面的cost为

\[\begin{aligned} J(\mathbf{B}) &\triangleq \sum_i \{\lVert \mathbf{P_B^\perp} \mathbf{x}_i \rVert^2\} \\ &= \sum_i \mathbf{x}_i^T (\mathbf{P_B^\perp})^T \mathbf{P_B^\perp} \mathbf{x}_i \\ &= \sum_i \mathbf{x}_i^T \mathbf{P_B^\perp} \mathbf{x}_i \\ &= \sum_i \mbox{tr}(\mathbf{x}_i^T \mathbf{P_B^\perp} \mathbf{x}_i) \\ &= \sum_i \mbox{tr}(\mathbf{P_B^\perp} \mathbf{x}_i \mathbf{x}_i^T) \\ &= \mbox{tr}(\mathbf{P_B^\perp} \sum_i \mathbf{x}_i \mathbf{x}_i^T) \\ &= n \mbox{tr}(\mathbf{P_B^\perp}~\frac{1}{n} \sum_i \mathbf{x}_i \mathbf{x}_i^T) \\ &= n \mbox{tr}(\mathbf{P_B^\perp}~\mathbf{Q}) \end{aligned}\]

可以证明$\mathbf{Q}$是半正定矩阵,因此可以对$\mathbf{Q}$做特征分解eigen-decomposition:

\[\mathbf{Q} = \mathbf{V\Lambda V^T}\]

其中:

  • $\mathbf{V}$是正交矩阵orthogonal,即$\mathbf{VV}^T = \mathbf{I}_d = \mathbf{V}^T \mathbf{V}$
  • $\Lambda = \mbox{Diag}(\lambda_1, …, \lambda_d), \lambda_j \geq 0, \forall j$,即$\Lambda$是由特征值组成的对角矩阵(由于是$\mathbf{Q}$是半正定矩阵,所以特征值均为非负数)

此外,我们不妨假设${\lambda_j}$是从大到小排列的

Theorem (Eckart-Young, 1936)

最优解$\mathbf{B}\in \mathbb{R}^{d\times R}$是由$\mathbf{Q}$的$R$个主特征向量(principal eigenvector)组成的

\[\hat{\mathbf{B}} = \arg \min_\mathbf{B} n \mbox{tr}(\mathbf{P_B^\perp}Q) = [v_1, ..., v_R] \triangleq \mathbf{V}_R\]

更准确地说,最优解$\hat{\mathbf{B}}$是任何满足$\mbox{colsp}(\mathbf{B}) = \mbox{colsp}(\mathbf{V}_R)$的$\textbf{B}$

Performance of PCA

给定rank $R$,我们该如何评价PCA的性能?(有助于我们选择合适的$R$值)

Recall:在线性回归里,我们有$R^2 \triangleq 1 - \dfrac{\mbox{RSS}}{ns_y^2}$

对于PCA来说,我们用的是方差比例proportion of variance:

\[\mbox{PoV} \triangleq 1 - \frac{\mbox{RSS}}{\sum_{i=1}^n \lVert \mathbf{x}_i \rVert^2}\]

其中:

  • $\sum_ {i=1}^n \lVert \mathbf{x}_ i \rVert^2 = \sum_ {i=1}^n \mbox{tr}(\mathbf{x}_ i \mathbf{x}_ i^T) = n\mbox{tr}(\mathbf{Q}) = n\mbox{tr}(\mathbf{V\Lambda V^T}) = n\sum_ {j=1}^d \lambda_ j$
  • $\mbox{RSS} = n\mbox{tr}(\mathbf{Q}) - n\mbox{tr}(\mathbf{P_B Q}) = n\sum_{j=1}^d \lambda_j - n\sum_{j=1}^R \lambda_j = n\sum_{j={R+1}}^d \lambda_j$

因此,$R$个主成分的$\mbox{PoV}$是

\[\mbox{PoV}(R) = \frac{n \sum_{j=1}^d \lambda_j}{n \sum_{j=1}^d \lambda_j} - \frac{n \sum_{j=R+1}^d \lambda_j}{n \sum_{j=1}^d \lambda_j} = \frac{\sum_{j=1}^R \lambda_j}{\sum_{j=1}^d \lambda_j}\]

$\mbox{PoV}$越接近1越好

Computing PCA via the SVD

SVD - Singular Value Decomposition

给定任意矩阵$\mathbf{X}\in \mathbb{R}^{n\times d}$,记$r \triangleq \mbox{rank}(\mathbf{X})$

  • 标准SVD (standard SVD) 分解:
\[\mathbf{X} = \mathbf{USV}^T, \mbox{where}~\Bigg\{ \begin{aligned} &\mathbf{U} \in \mathbb{R}^{n\times n}~\mbox{obeys}~\mathbf{UU}^T = \mathbf{I}_n = \mathbf{U}^T\mathbf{U} \\ &\mathbf{V} \in \mathbb{R}^{d\times d}~\mbox{obeys}~\mathbf{VV}^T = \mathbf{I}_d = \mathbf{V}^T\mathbf{V} \\ &\mathbf{S} \in \mathbb{R}^{n\times d}~\mbox{obeys}~\mathbf{S} = \mbox{Diag}(s_1, ..., s_m)~\mbox{where}~m=\min\{n, d\}~\mbox{and}~s_1 \geq s_2 \geq ... \geq 0 \\ \end{aligned}\]
  • 经济型SVD分解 (economy SVD) 分解:
\[\mathbf{X} = \mathbf{U}_R\mathbf{S}_r\mathbf{V}_r^T, \mbox{where}~\Bigg\{ \begin{aligned} &\mathbf{U}_r \in \mathbb{R}^{n\times r}~\mbox{obeys}~\mathbf{U}_r^T\mathbf{U}_r = \mathbf{I}_r \\ &\mathbf{V}_r \in \mathbb{R}^{d\times r}~\mbox{obeys}~\mathbf{V}_r^T\mathbf{V}_r = \mathbf{I}_r \\ &\mathbf{S}_r \in \mathbb{R}^{r\times r}~\mbox{obeys}~\mathbf{S}_r = \mbox{Diag}(s_1, ..., s_r)~\mbox{where}~r<\min\{n, d\}~\mbox{and}~s_1 \geq s_2 \geq ... \geq s_r > 0 \\ \end{aligned}\]

Computing PCA via the Standard SVD

依然假设$\mathbf{X} = [\mathbf{x}_1, …, \mathbf{x}_n]^T \in \mathbb{R}^{n\times d}$是中心化 (centered) 的(即均值mean=0)

对于PCA来说,协方差矩阵$\mathbf{Q}$的特征分解(假设特征值已从大到小排序)是

\(\mathbf{Q} = \frac{1}{n} \mathbf{X}^T\mathbf{X} = \mathbf{V\Lambda V^T}\) 其中:

  • $\mathbf{V}$是正交矩阵orthogonal,即$\mathbf{VV}^T = \mathbf{I}_d = \mathbf{V}^T \mathbf{V}$

  • $\Lambda = \mbox{Diag}(\lambda_1, …, \lambda_d), \lambda_1 \geq \lambda_2 \geq … \geq 0$,即$\Lambda$是由特征值组成的对角矩阵,从左上到右下特征值递减(由于是$\mathbf{Q}$是半正定矩阵,所以特征值均为非负数)

代入$\mathbf{X}$的标准SVD分解,即$\mathbf{X} = \mathbf{USV}^T$,可得

\[\mathbf{Q} = \frac{1}{n} \mathbf{V} \mathbf{S}^T \mathbf{U}^T \mathbf{USV}^T = \mathbf{V} (\frac{1}{n}\mathbf{S}^T\mathbf{S})\mathbf{V}^T\]

其中:

  • $\mathbf{VV}^T = \mathbf{I}_d = \mathbf{V}^T \mathbf{V}$

  • $\dfrac{1}{n} \mathbf{S}^T \mathbf{S} = \mbox{Diag}(\dfrac{s_1^2}{n}, …, \dfrac{s_d^2}{n})$

  • $\dfrac{s_1^2}{n} \geq \dfrac{s_2^2}{n} \geq … \geq 0$

因此,$\mathbf{X}$的SVD中的$\mathbf{V}$和${s_j}$两项就足以计算PCA

  • $\lambda_j = \dfrac{s_j^2}{n}$

  • 这里的$\mathbf{V}$就是PCA里的$\mathbf{V}$

Computing PCA via the Economy SVD

  • 经济型SVD只计算最高的$r$个奇异向量singular vectors,即$\mathbf{U}_r$和$\mathbf{V}_r$,其中$r = \mbox{rank}(\mathbf{X}) \leq \min(n, d)$

    因此,经济型SVD计算量应该比标准SVD要小(废话,否则为什么叫这个名字)

  • 对于PCA来说,只需要计算最高的$R$个奇异向量$\mathbf{V}_R$,其中$R < r$,且通常$R \ll r$

    因此,用经济型SVD来计算PCA更高效,即

\[\begin{aligned} (\mathbf{U}_r, \mathbf{S}_r, \mathbf{V}_r^T) &= \mbox{SVD}_{\mbox{economy}}(\mathbf{X}) \\ \mathbf{V}_R &= \mbox{first}~R~\mbox{columns of}~\mathbf{V}_r \\ \mathbf{z}_i &= \mathbf{V}_R^T\mathbf{x}_i~\forall i=1, ..., n \end{aligned}\]

Standardizing the PCA Coefficients

PCA的系数${\mathbf{z}_i}$可以用来分类或回归(假设这时候我们又有label了= =)

如果是这样,我们首先应该标准化standardize我们的新feature ${\mathbf{z}_i}$,即我们想要的是:

  1. $\overline{\mathbf{z}} \triangleq \dfrac{1}{n} \sum_{i=1}^n \mathbf{z}_i = \mathbf{0}$

  2. $\dfrac{1}{n} \sum_{i=1}^n z_{ij}^2 = 1, \forall j=1, …, R$,或者说,$\mbox{Diag}(\dfrac{1}{n} \sum_{i=1}^n \mathbf{z}_i \mathbf{z}_i^T) = 1$

定义$\mathbf{Q}_ z \triangleq \dfrac{1}{n} \sum_ {i=1}^n \mathbf{z}_ i \mathbf{z}_ i^T$,接下来我们来分析$\overline{\mathbf{z}}$和$\mbox{Diag}(\mathbf{Q}_ z)$的值:

  • $\overline{\mathbf{z}} = \dfrac{1}{n} \sum_ {i=1}^n \mathbf{V}_ R^T \mathbf{x}_ i = \mathbf{V}R^T (\dfrac{1}{n} \sum {i=1}^n \mathbf{x}_ i) = \mathbf{V}_ R^T \mathbf{0} = \mathbf{0}$,因为 ${\mathbf{x}_i}$已经中心化了

  • $\mathbf{Q}_ z = \dfrac{1}{n} \sum_{i=1}^n \mathbf{z}_ i \mathbf{z}_ i^T = \mathbf{V}_ R^T (\dfrac{1}{n} \sum_{i=1}^n \mathbf{x}_ i \mathbf{x}_ i^T) \mathbf{V}_ R = \mathbf{V}_ R^T \mathbf{Q} \mathbf{V}_ R = \dfrac{1}{n} \mathbf{V}_ R^T \mathbf{V}_ r \mathbf{S}_ r \mathbf{V}_ r^T \mathbf{V}_ R = \mbox{Diag}(\dfrac{s_ 1^2}{n}, …, \dfrac{s_ R^2}{n})$,因为$\mathbf{V}R^T\mathbf{V}_r = [\mathbf{I}_R~\mathbf{0}{R\times (r-R)}]$

换句话说,标准化后的${\mathbf{z}_i}$,或者说${\mathbf{\tilde{z}}_i}$应该是

\[\tilde{\mathbf{z}}_i \triangleq \mbox{Diag}(\frac{\sqrt n}{s_1}, ..., \frac{\sqrt n}{s_R}) \mathbf{z}_i\]