# 线性回归

# 模型

假设： $y$ 与 $\mathbf x$ 大致是线性关系

模型： $\hat{y} = w_1 x_1 + \cdots + w_d x_d + b$

更简洁的表示： $\hat{y} = \mathbf{w}^\top \mathbf{x} + b$

可以通过将 $\mathbf x$ 增加一项 1，进而把 $b$ 纳入 $\mathbf w$

对于所有数据，有如下表示： ${\hat{\mathbf{y}}} = \mathbf{X} \mathbf{w} + b$

同理可以通过将 $\mathbf X$ 增加一列 1，进而把 $b$ 纳入 $\mathbf w$

# 损失函数 Loss Function

Loss functions quantify the distance between the real and predicted values of the target. --d2l

损失函数用以衡量真实值与预测值的距离

非负，越接近 0，模型的拟合效果越好

# 平方误差 Squared Error

对于第 i 个样本，有平方误差：

$l^{(i)}(\mathbf{w}, b) = \frac{1}{2} \left(\hat{y}^{(i)} - y^{(i)}\right)^2$

效果：

预测值与真实值的较大差距会带来更大的损失
对异常数据过于敏感

损失函数：

$L(\mathbf{w}, b) =\frac{1}{n}\sum_{i=1}^n l^{(i)}(\mathbf{w}, b) =\frac{1}{n} \sum_{i=1}^n \frac{1}{2}\left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right)^2$

目标：

找到使 $L(\mathbf{w}, b)$ 最小的一组 $\mathbf w,b$ 记作， $\mathbf{w}^*, b^* = \operatorname*{argmin}_{\mathbf{w}, b}\ L(\mathbf{w}, b)$

# 解析解

出于简化目的，将 $\mathbf X$ 增加一列 1， $b$ 纳入 $\mathbf w$ ，问题转化为使 $\|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2$ 最小（二阶范数的平方，即平方和，等同于损失函数）

As long as the design matrix $\mathbf X$ has full rank (no feature is linearly dependent on the others), then there will be just one critical point on the loss surface and it corresponds to the minimum of the loss over the entire domain. --d2l

大意： $\mathbf X$ 满秩，损失函数上有且仅有一个临界点，并对应最小的损失值

对 $\mathbf w$ 求导，令等于 0：

\begin{aligned} \partial_{\mathbf{w}} \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 = 2 \mathbf{X}^\top (\mathbf{X} \mathbf{w} - \mathbf{y}) = 0 \textrm{ and hence } \mathbf{X}^\top \mathbf{y} = \mathbf{X}^\top \mathbf{X} \mathbf{w}. \end

得：\mathbf{w}^* = (\mathbf X^\top \mathbf X)^{-1}\mathbf X^\top \mathbf

# 梯度下降 Gradient Descent

iteratively reducing the error by updating the parameters in the direction that incrementally lowers the loss function. --d2l

思想：沿着梯度下降的方向移动模型的参数

# 朴素 naive

计算所有数据的 Loss 对 $\mathbf w$ 的导数，取平均，进行一步梯度下降

慢！

# 随机 stochastic gradient descent (SGD)

每一次随机一个样本，进行一步梯度下降

效果：

对于大数据集可以成为有效的方法
不适合计算机计算（慢）
统计学上不一定适用 or 正确

关于第三点的原文：

A second problem is that some of the layers, such as batch normalization (to be described in Section 8.5), only work well when we have access to more than one observation at a time.

# 小批量随机 Minibatch Stochastic Gradient Descent

batch size 每一批量的选取：

与内存、加速器、层数、数据集相关
建议 32 - 256 $2^n$

每一步更新的数学表示：

$(\mathbf{w},b) \leftarrow (\mathbf{w},b) - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}_t} \partial_{(\mathbf{w},b)} l^{(i)}(\mathbf{w},b)$

$\mathcal{B}$ 表示每一批选取的数量， $\eta$ 表示学习率（learning rate）即每一步更新的幅度

# 其他问题

However, the loss surfaces for deep networks contain many saddle points and minima. --d2l

损失函数上可能有鞍点或局部最小值

The more formidable task is to find parameters that lead to accurate predictions on previously unseen data, a challenge called generalization --d2l

相较于训练，泛化更困难

# 推理 | 预测（inference|prediction）

字面含义，代入训练好的模型获取预测值

# 加速

将计算矢量化并调用快速线性代数库

# 正态分布

$p(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left(-\frac{1}{2 \sigma^2} (x - \mu)^2\right)$

运用正态分布，提供平方损失函数目标（最小）的另一种理解

（先前我们通过令损失函数导数等于 0 得到了参数 $\mathbf w$ 需要满足的条件）

假设噪声 $\epsilon$ 符合正态分布

$y = \mathbf{w}^\top \mathbf{x} + b + \epsilon \textrm{ where } \epsilon \sim \mathcal{N}(0, \sigma^2)$

给定 $\mathbf w,b$ ，对于一组样本，输入为 $\mathbf x$ 时，输出为 $y$ 的概率：

$P(y \mid \mathbf{x}) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left(-\frac{1}{2 \sigma^2} (y - \mathbf{w}^\top \mathbf{x} - b)^2\right).$

假设每一组样本之间相互独立，则对于所有数据集 $\mathbf X$ 输出为 $y$ 的概率：

$P(\mathbf y \mid \mathbf X) = \prod_{i=1}^{n} p(y^{(i)} \mid \mathbf{x}^{(i)}).$

目标：求一组 $\mathbf w,b$ 使 $P(\mathbf y \mid \mathbf X)$ 最大

取对数，再取相反数，即求下式最小值对应的 $\mathbf w,b$ ：

$-\log P(\mathbf y \mid \mathbf X) = \sum_{i=1}^n \frac{1}{2} \log(2 \pi \sigma^2) + \frac{1}{2 \sigma^2} \left(y^{(i)} - \mathbf{w}^\top \mathbf{x}^{(i)} - b\right)^2.$

发现与平方损失函数的目标相同

# 与神经网络

../_images/singleneuron.svg

Linear regression is a single-layer neural network. --d2l

线性回归是一个单层的神经网络

[end]

2024/1/31

mofianger 整理

参考 3.1. Linear Regression — Dive into Deep Learning 1.0.3 documentation (d2l.ai)

部分文字图片引用自 en.d2l.ai