PCA

Data Representation

数据集：${ \mathbf{x} i }{i=1}^n$
- 每个样本都有$d$个特征：$ \mathbf{x}_ i = [ x_ {i1}, …, x_ {id} ]^T \in \mathbb{R}^d $
- 可以把数据集表示为矩阵$\mathbf{X} = [\mathbf{x}_1, …, \mathbf{x}_n]^T \in \mathbb{R}^{n\times d}$
- 可以假设数据都中心化 (centered) 了，即均值为0
注意：这里没有label，是无监督学习

Main Ideas

Idea 1: 用低维向量$z_i \in \mathbb{R}^R$线性近似特征向量$x_i \in \mathbb{R}^d$
\[x_i \approx B z_i, i=1, ..., n\]
其中：
- $\mathbf{B} \in \mathbb{R}^{d\times R}$的列向量组成了一个包含R个元素的dictionary
- $R$是近似的rank，$1\leq R < d$，且理想情况下$R \ll d$
Idea 2: 该近似的目标是最小化RSS
\[(\hat{\mathbf{B}}, \{\hat{\mathbf{z}}_i\}_{i=1}^n) = \arg \min_{\mathbf{B}, \{\mathbf{z}_i\}} \{\sum_{i=1}^n \lVert \mathbf{x}_i - \mathbf{Bz}_i \rVert^2 \}\]
注意：这里用RSS只是为了简单，可能效果不好

Solution

最优的$R$-element approximation dictionary是

\[\hat{\mathbf{B}} = \mathbf{V}_R \triangleq [\mathbf{v}_1, ..., \mathbf{v}_R]\]

其中${ \mathbf{v}_ r }_ {r=1}^R$是sample covariance matrix $\mathbf{Q}$的最大的$R$个特征值${\mathbf{\lambda}_ r}_ {r=1}^R$对应的特征向量

\[\mathbf{Q} \triangleq \frac{1}{n} \sum_{i=1}^n \mathbf{x}_i \mathbf{x}_i^T = \sum_{j=1}^ d \lambda_j \mathbf{v}_j \mathbf{v}_j^T\]

最优的系数是

\[\hat{\mathbf{z}}_i = \mathbf{V}_R^T \mathbf{x}_i\]

PCA将$\mathbf{x}_i \in \mathbb{R}^d$投影到由${\mathbf{v}_ r}_ {r=1}^R$张成的子空间中，${\mathbf{v}_ r}_ {r=1}^R$也被称为$R$个主成分

Orthogonal Projection

记矩阵$\mathbf{B}$的列向量张成的子空间（也就是$\mathbf{B}$的列向量的线性组合）为$\mbox{colsp}(\mathbf{B}) \triangleq { \mathbf{Bz} \quad s.t.~\mathbf{z} \in \mathbb{R}^R} \subset \mathbb{R}^d$

则若将向量$\mathbf{x}\in\mathbb{R}^d$投影到$\mbox{colsp}(\mathbf{B})$上，有

\[\mathbf{x} = \hat{\mathbf{x}} + \mathbf{e}\]

其中$\hat{\mathbf{x}} \in \mbox{colsp}(\mathbf{B})$，$\mathbf{e} \perp \mbox{colsp}(\mathbf{B})$

而$\hat{\mathbf{x}}$和$\mathbf{e}$可以这样计算得到：

\[\hat{\mathbf{x}} = \mathbf{P_B x},~ \mathbf{e} = \mathbf{P_B^\perp x}\]

其中$\mathbf{P}_ \mathbf{B} = \mathbf{B}(\mathbf{B}^T\mathbf{B})^{-1} \mathbf{B}^T$，$\mathbf{P}_ \mathbf{B}^\perp = \mathbf{I} - \mathbf{P_ B}$

$\mathbf{P}_ \mathbf{B}$和$\mathbf{P}_ \mathbf{B}^\perp$被称为投影算子，满足$\mathbf{P}_ \mathbf{B} = \mathbf{P}_ \mathbf{B}^T$和$\mathbf{P}_ \mathbf{B} = \mathbf{P}_ \mathbf{B}^2$

Derivation

首先固定$\mathbf{B}$，优化$\hat{\mathbf{z}}_i$（即最小二乘法）

\[\hat{\mathbf{z}}_i = \arg \min_{\mathbf{z}_i} \lVert \mathbf{x}_i - \mathbf{Bz}_i \rVert^2 = (\mathbf{B}^T\mathbf{B})^{-1} \mathbf{B}^T \mathbf{x}_i\]

把${\hat{\mathbf{z}}_ i}_ {i=1}^n$代回原问题，可以得到

\[\begin{aligned} \hat{\mathbf{B}} &= \arg \min_{\mathbf{B}} \{\sum_{i=1}^n \lVert \mathbf{x}_i - \mathbf{B}(\mathbf{B}^T\mathbf{B})^{-1} \mathbf{B}^T \mathbf{x}_i \rVert^2\} \\ &= \arg \min_{\mathbf{B}} \{\sum_{i=1}^n \lVert (\mathbf{I} - \mathbf{B}(\mathbf{B}^T\mathbf{B})^{-1} \mathbf{B}^T) \mathbf{x}_i \rVert^2\} \\ &= \arg \min_{\mathbf{B}} \sum_{i=1}^n \{\lVert \mathbf{P_B^\perp} \mathbf{x}_i \rVert^2\} \\ &= \arg \min_{\mathbf{B}} \sum_{i=1}^n \{\lVert \mathbf{e}_i \mathbf{x}_i \rVert^2\} \end{aligned}\]

也就是说PCA选择的$\mathbf{B}$能够最小化投影误差的平方和

进一步地，我们可以定义上面的cost为

\[\begin{aligned} J(\mathbf{B}) &\triangleq \sum_i \{\lVert \mathbf{P_B^\perp} \mathbf{x}_i \rVert^2\} \\ &= \sum_i \mathbf{x}_i^T (\mathbf{P_B^\perp})^T \mathbf{P_B^\perp} \mathbf{x}_i \\ &= \sum_i \mathbf{x}_i^T \mathbf{P_B^\perp} \mathbf{x}_i \\ &= \sum_i \mbox{tr}(\mathbf{x}_i^T \mathbf{P_B^\perp} \mathbf{x}_i) \\ &= \sum_i \mbox{tr}(\mathbf{P_B^\perp} \mathbf{x}_i \mathbf{x}_i^T) \\ &= \mbox{tr}(\mathbf{P_B^\perp} \sum_i \mathbf{x}_i \mathbf{x}_i^T) \\ &= n \mbox{tr}(\mathbf{P_B^\perp}~\frac{1}{n} \sum_i \mathbf{x}_i \mathbf{x}_i^T) \\ &= n \mbox{tr}(\mathbf{P_B^\perp}~\mathbf{Q}) \end{aligned}\]

可以证明$\mathbf{Q}$是半正定矩阵，因此可以对$\mathbf{Q}$做特征分解eigen-decomposition：

\[\mathbf{Q} = \mathbf{V\Lambda V^T}\]

其中：

$\mathbf{V}$是正交矩阵orthogonal，即$\mathbf{VV}^T = \mathbf{I}_d = \mathbf{V}^T \mathbf{V}$
$\Lambda = \mbox{Diag}(\lambda_1, …, \lambda_d), \lambda_j \geq 0, \forall j$，即$\Lambda$是由特征值组成的对角矩阵（由于是$\mathbf{Q}$是半正定矩阵，所以特征值均为非负数）

此外，我们不妨假设${\lambda_j}$是从大到小排列的

Theorem (Eckart-Young, 1936)

最优解$\mathbf{B}\in \mathbb{R}^{d\times R}$是由$\mathbf{Q}$的$R$个主特征向量（principal eigenvector）组成的
\[\hat{\mathbf{B}} = \arg \min_\mathbf{B} n \mbox{tr}(\mathbf{P_B^\perp}Q) = [v_1, ..., v_R] \triangleq \mathbf{V}_R\]
更准确地说，最优解$\hat{\mathbf{B}}$是任何满足$\mbox{colsp}(\mathbf{B}) = \mbox{colsp}(\mathbf{V}_R)$的$\textbf{B}$

Performance of PCA

给定rank $R$，我们该如何评价PCA的性能？（有助于我们选择合适的$R$值）

Recall：在线性回归里，我们有$R^2 \triangleq 1 - \dfrac{\mbox{RSS}}{ns_y^2}$

对于PCA来说，我们用的是方差比例proportion of variance：

\[\mbox{PoV} \triangleq 1 - \frac{\mbox{RSS}}{\sum_{i=1}^n \lVert \mathbf{x}_i \rVert^2}\]

其中：

$\sum_ {i=1}^n \lVert \mathbf{x}_ i \rVert^2 = \sum_ {i=1}^n \mbox{tr}(\mathbf{x}_ i \mathbf{x}_ i^T) = n\mbox{tr}(\mathbf{Q}) = n\mbox{tr}(\mathbf{V\Lambda V^T}) = n\sum_ {j=1}^d \lambda_ j$
$\mbox{RSS} = n\mbox{tr}(\mathbf{Q}) - n\mbox{tr}(\mathbf{P_B Q}) = n\sum_{j=1}^d \lambda_j - n\sum_{j=1}^R \lambda_j = n\sum_{j={R+1}}^d \lambda_j$

因此，$R$个主成分的$\mbox{PoV}$是

\[\mbox{PoV}(R) = \frac{n \sum_{j=1}^d \lambda_j}{n \sum_{j=1}^d \lambda_j} - \frac{n \sum_{j=R+1}^d \lambda_j}{n \sum_{j=1}^d \lambda_j} = \frac{\sum_{j=1}^R \lambda_j}{\sum_{j=1}^d \lambda_j}\]

$\mbox{PoV}$越接近1越好

Computing PCA via the SVD

SVD - Singular Value Decomposition

给定任意矩阵$\mathbf{X}\in \mathbb{R}^{n\times d}$，记$r \triangleq \mbox{rank}(\mathbf{X})$

标准SVD (standard SVD) 分解：

\[\mathbf{X} = \mathbf{USV}^T, \mbox{where}~\Bigg\{ \begin{aligned} &\mathbf{U} \in \mathbb{R}^{n\times n}~\mbox{obeys}~\mathbf{UU}^T = \mathbf{I}_n = \mathbf{U}^T\mathbf{U} \\ &\mathbf{V} \in \mathbb{R}^{d\times d}~\mbox{obeys}~\mathbf{VV}^T = \mathbf{I}_d = \mathbf{V}^T\mathbf{V} \\ &\mathbf{S} \in \mathbb{R}^{n\times d}~\mbox{obeys}~\mathbf{S} = \mbox{Diag}(s_1, ..., s_m)~\mbox{where}~m=\min\{n, d\}~\mbox{and}~s_1 \geq s_2 \geq ... \geq 0 \\ \end{aligned}\]

经济型SVD分解 (economy SVD) 分解：

\[\mathbf{X} = \mathbf{U}_R\mathbf{S}_r\mathbf{V}_r^T, \mbox{where}~\Bigg\{ \begin{aligned} &\mathbf{U}_r \in \mathbb{R}^{n\times r}~\mbox{obeys}~\mathbf{U}_r^T\mathbf{U}_r = \mathbf{I}_r \\ &\mathbf{V}_r \in \mathbb{R}^{d\times r}~\mbox{obeys}~\mathbf{V}_r^T\mathbf{V}_r = \mathbf{I}_r \\ &\mathbf{S}_r \in \mathbb{R}^{r\times r}~\mbox{obeys}~\mathbf{S}_r = \mbox{Diag}(s_1, ..., s_r)~\mbox{where}~r<\min\{n, d\}~\mbox{and}~s_1 \geq s_2 \geq ... \geq s_r > 0 \\ \end{aligned}\]

Computing PCA via the Standard SVD

依然假设$\mathbf{X} = [\mathbf{x}_1, …, \mathbf{x}_n]^T \in \mathbb{R}^{n\times d}$是中心化 (centered) 的（即均值mean=0）

对于PCA来说，协方差矩阵$\mathbf{Q}$的特征分解（假设特征值已从大到小排序）是

$\mathbf{Q} = \frac{1}{n} \mathbf{X}^T\mathbf{X} = \mathbf{V\Lambda V^T}$ 其中：

$\mathbf{V}$是正交矩阵orthogonal，即$\mathbf{VV}^T = \mathbf{I}_d = \mathbf{V}^T \mathbf{V}$
$\Lambda = \mbox{Diag}(\lambda_1, …, \lambda_d), \lambda_1 \geq \lambda_2 \geq … \geq 0$，即$\Lambda$是由特征值组成的对角矩阵，从左上到右下特征值递减（由于是$\mathbf{Q}$是半正定矩阵，所以特征值均为非负数）

代入$\mathbf{X}$的标准SVD分解，即$\mathbf{X} = \mathbf{USV}^T$，可得

\[\mathbf{Q} = \frac{1}{n} \mathbf{V} \mathbf{S}^T \mathbf{U}^T \mathbf{USV}^T = \mathbf{V} (\frac{1}{n}\mathbf{S}^T\mathbf{S})\mathbf{V}^T\]

其中：

$\mathbf{VV}^T = \mathbf{I}_d = \mathbf{V}^T \mathbf{V}$
$\dfrac{1}{n} \mathbf{S}^T \mathbf{S} = \mbox{Diag}(\dfrac{s_1^2}{n}, …, \dfrac{s_d^2}{n})$
$\dfrac{s_1^2}{n} \geq \dfrac{s_2^2}{n} \geq … \geq 0$

因此，$\mathbf{X}$的SVD中的$\mathbf{V}$和${s_j}$两项就足以计算PCA

$\lambda_j = \dfrac{s_j^2}{n}$
这里的$\mathbf{V}$就是PCA里的$\mathbf{V}$

Computing PCA via the Economy SVD

经济型SVD只计算最高的$r$个奇异向量singular vectors，即$\mathbf{U}_r$和$\mathbf{V}_r$，其中$r = \mbox{rank}(\mathbf{X}) \leq \min(n, d)$

因此，经济型SVD计算量应该比标准SVD要小~~（废话，否则为什么叫这个名字）~~
对于PCA来说，只需要计算最高的$R$个奇异向量$\mathbf{V}_R$，其中$R < r$，且通常$R \ll r$

因此，用经济型SVD来计算PCA更高效，即

\[\begin{aligned} (\mathbf{U}_r, \mathbf{S}_r, \mathbf{V}_r^T) &= \mbox{SVD}_{\mbox{economy}}(\mathbf{X}) \\ \mathbf{V}_R &= \mbox{first}~R~\mbox{columns of}~\mathbf{V}_r \\ \mathbf{z}_i &= \mathbf{V}_R^T\mathbf{x}_i~\forall i=1, ..., n \end{aligned}\]

Standardizing the PCA Coefficients

PCA的系数${\mathbf{z}_i}$可以用来分类或回归（假设这时候我们又有label了= =）

如果是这样，我们首先应该标准化standardize我们的新feature ${\mathbf{z}_i}$，即我们想要的是：

$\overline{\mathbf{z}} \triangleq \dfrac{1}{n} \sum_{i=1}^n \mathbf{z}_i = \mathbf{0}$
$\dfrac{1}{n} \sum_{i=1}^n z_{ij}^2 = 1, \forall j=1, …, R$，或者说，$\mbox{Diag}(\dfrac{1}{n} \sum_{i=1}^n \mathbf{z}_i \mathbf{z}_i^T) = 1$

定义$\mathbf{Q}_ z \triangleq \dfrac{1}{n} \sum_ {i=1}^n \mathbf{z}_ i \mathbf{z}_ i^T$，接下来我们来分析$\overline{\mathbf{z}}$和$\mbox{Diag}(\mathbf{Q}_ z)$的值：

$\overline{\mathbf{z}} = \dfrac{1}{n} \sum_ {i=1}^n \mathbf{V}_ R^T \mathbf{x}_ i = \mathbf{V}R^T (\dfrac{1}{n} \sum {i=1}^n \mathbf{x}_ i) = \mathbf{V}_ R^T \mathbf{0} = \mathbf{0}$，因为 ${\mathbf{x}_i}$已经中心化了
$\mathbf{Q}_ z = \dfrac{1}{n} \sum_{i=1}^n \mathbf{z}_ i \mathbf{z}_ i^T = \mathbf{V}_ R^T (\dfrac{1}{n} \sum_{i=1}^n \mathbf{x}_ i \mathbf{x}_ i^T) \mathbf{V}_ R = \mathbf{V}_ R^T \mathbf{Q} \mathbf{V}_ R = \dfrac{1}{n} \mathbf{V}_ R^T \mathbf{V}_ r \mathbf{S}_ r \mathbf{V}_ r^T \mathbf{V}_ R = \mbox{Diag}(\dfrac{s_ 1^2}{n}, …, \dfrac{s_ R^2}{n})$，因为$\mathbf{V}R^T\mathbf{V}_r = [\mathbf{I}_R~\mathbf{0}{R\times (r-R)}]$

换句话说，标准化后的${\mathbf{z}_i}$，或者说${\mathbf{\tilde{z}}_i}$应该是

\[\tilde{\mathbf{z}}_i \triangleq \mbox{Diag}(\frac{\sqrt n}{s_1}, ..., \frac{\sqrt n}{s_R}) \mathbf{z}_i\]

PCA

PCA