Linear Regression

Posted by AH on April 13, 2022

Statistics Review

  • 均值 sample mean
\[\overline{x} \triangleq \frac{1}{n} \sum_{i=1}^n x_i,~\overline{y} \triangleq \frac{1}{n} \sum_{i=1}^n y_i\]
  • 方差 sample variance
\[s_x^2 \triangleq \frac{1}{n} \sum_{i=1}^n (x_i - \overline{x})^2, ~s_y^2 \triangleq \frac{1}{n} \sum_{i=1}^n (y_i - \overline{y})^2\]

​ (无偏估计:将上式中$\frac{1}{n}$替换为$\frac{1}{n-1}$)

​ 或:

\[s_x^2 = \frac{1}{n} \sum_{i=1}^n x_i^2 - 2\cdot\overline{x}\cdot\frac{1}{n}\sum_{i=1}^n x_i + \overline{x}^2 = \frac{1}{n} x_i^2 - \overline{x}^2 \\ \Rightarrow \frac{1}{n} \sum_{i=1}^n x_i^2 = s_x^2+\overline{x}^2\]
  • 标准差 sample standard deviation, SD
\[s_x = \sqrt{s_x^2}, ~s_y = \sqrt{s_y^2}\]
  • 协方差 sample covariance:描述了${x_i}$和${y_i}$的相关程度
\[s_{xy} \triangleq \frac{1}{n} \sum_{i=1}^n (x_i-\overline{x})(y_i-\overline{y})\]

​ 或:

\[s_{xy} = \frac{1}{n} \sum_{i=1}^n x_iy_i - \overline{xy} \\ \Rightarrow \frac{1}{n}\sum_{i=1}^n x_iy_i = s_{xy} + \overline{xy}\]
  • 相关系数 sample (Pearson) correlation coefficient
\[\rho_{xy} \triangleq \frac{s_{xy}}{s_x s_y} \in [-1, 1]\]

Simple Linear Regression

Model

模型($x$是scalar):

\[y \approx \beta_0 + \beta_1 x \triangleq \hat{y}\]

其中$\beta_0$是截距intercept,$\beta_1$是斜率slope,$\hat{y}$是预测的标签prediction

称$\mathbf{\beta} = [\beta_0, \beta_1]^T$为模型的参数parameters,也叫做coefficients或weights

Residual

然而$x$并不能确切地预测$y$,即$y \approx \beta_0+\beta_1 x$,我们可以进一步引入残差 (residual) $\epsilon$:

\[y = \beta_0+\beta_1 x + \epsilon\]

所以对于第$i$个样本,我们有

  • 预测值 predicted value:$\hat{y}_i = \beta_0 + \beta_1 x_i$
  • 残差 residual:$\epsilon_i = y_i - \hat{y}_i$

下图中,残差$\epsilon_i$是$y_i$到红色拟合曲线的垂直距离

Least-Squares Fit

我们通过最小化残差平方和(RSS, the residual sum of squares) 来选择模型参数$\mathbf{\beta} = [\beta_0, \beta_1]^T$:

\[RSS(\beta_0, \beta_1) \triangleq \sum_{i=1}^n \epsilon_i^2 = \sum_{i=1}^n (y_i-\hat{y}_i)^2\]

其中$\epsilon_i$和$\hat{y}_i$是$[\beta_0, \beta_1]$的函数

几何上看,最小化RSS $\Leftrightarrow$ 最小化样本点拟合曲线距离平方的和

Minimizing RSS

可以通过求导最小化RSS

\[0 = \frac{\partial RSS(\beta_0, \beta_1)}{\partial \beta_0} = \frac{\partial}{\partial \beta_0} \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2 = -2\sum_{i=1}^n (y_i-\beta_0-\beta_1x_i) \qquad(1)\\ 0 = \frac{\partial RSS(\beta_0, \beta_1)}{\partial \beta_1} = \frac{\partial}{\partial \beta_1} \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2 = -2\sum_{i=1}^n x_i(y_i-\beta_0-\beta_1x_i) \quad(2)\]

(1) 式可得$\beta_0 = \overline{y} - \beta_1 \overline{x} \quad (3)$

(2) 式可得$0 = \dfrac{1}{n} \sum_{i=1}^n x_iy_i - \overline{x} \beta_0 - \beta_1 \dfrac{1}{n} \sum_{i=1}^n x_i^2 \quad (4)$

将(3)代入(4)可得$0=\dfrac{1}{n}\sum_{i=1}^n x_iy_i - \overline{xy} - \beta_1(\dfrac{1}{n} \sum_{i=1}^n x_i^2 - \overline{x}^2) \Leftrightarrow \beta_1 = \dfrac{s_{xy}}{s_{xx}} \quad (5)$

将(3)和(5)代回RSS,可得

\[RSS(\beta_0, \beta_1) = n(1-\frac{s_{xy}^2}{s_{xx}s_{yy}})s_{yy} = n(1-\rho_{xy}^2)s_{yy}\]

$R^2$ Goodness-of-Fit

定义

\[R^2 \triangleq \frac{1-RSS/n}{s_y^2}\]

$R^2$也叫做决定系数 coefficient of determination

  • $R^2=1$:完美预测器 ($\hat{y}_i = y_i$)
  • $R^2=0$:平均水平 ($\hat{y}_i = y_i$)
  • $R^2<0$:比平均还差

如果用最小二乘法的系数,有$R^2 = \rho_{xy}^2$

另,由于最小二乘法中,$\beta_1 = \dfrac{\rho_{xy}s_y}{s_x}$,可得$sgn(\beta_1) = sgn(\rho_{xy})$

Multi-Linear Regression

Notation

  • $\mathbf{X}$: feature matrix
\[\mathbf{X} = \left[ \begin{aligned} &d = d_0 \times m \\ &L = \log_2 k \times m \end{aligned} \right]\]