Gradient Descent

方法1

对于光滑函数f(x)f(x),在xkx^k处,希望找到一个方向dkd^k使得函数下降最快。因此考虑

ϕ(α)=f(xk+αdk)\phi(\alpha)=f(x^k+\alpha d^k)

xkx^k处进行一阶泰勒展开,得:

ϕ(α)=f(xk)+α(f(xk))dk+O(α2dk2)\phi(\alpha) = f(x^k)+\alpha \left(\nabla f(x^k)\right)^\top d^k + \mathcal O(\alpha^2\|d^k\|^2)

根据柯西不等式,当α\alpha足够小时,取dk=f(xk)d^k=-\nabla f(x^k)会使得函数下降最快。

(f(xk))dkf(xk)dk\left(\nabla f(x^k)\right)^\top \cdot d^k \geq -\|\nabla f(x^k)\| \cdot \|d^k\|,当dk=f(xk)d^k=-\nabla f(x^k)时取等。

限制α\alpha足够小时为了让O(α2dk2)\mathcal{O}(\alpha^2\|d^k\|^2)足够小,否则解就不一定是dk=f(xk)d^k=-\nabla f(x^k)了。

方法2

对于光滑函数f(x)f(x),在xkx^k处,希望在xkx^k的一个邻域求解f(xk)f(x^k)的一阶近似的最小值。具体而言,我们要求解如下问题:

xk+1=argminx f(xk)+(f(xk))(xxk)s.t.xxkαf(xk)\begin{align*} x^{k+1}= \operatorname*{argmin}_{x} &~f(x^k)+\left(\nabla f(x^k)\right)^\top(x-x^k)\\ {\rm s.t.}& \|x-x^k\|\leq \alpha \|\nabla f(x^k)\| \end{align*}

约束条件只是假设,假设这个邻域的大小是由αf(xk)\alpha \|\nabla f(x^k)\|控制的。

image-20250327182850704

(f(xk))(xxk)=cos(f(xk),xxk)f(xk)xxkf(xk)xxkαf(xk)2\begin{align*} \left(\nabla f(x^k)\right)^\top(x-x^k) &= \cos\left(\nabla f(x^k),x-x^k\right)\left\|\nabla f(x^k)\right\|\|x-x^k\|\\ &\geq-\left\|\nabla f(x^k)\right\|\|x-x^k\|\\ &\geq-\alpha\left\|\nabla f(x^k)\right\|^2 \end{align*}

只要选xxk=αf(xk)x-x^k = -\alpha\nabla f(x^k)即可满足两个等号。

因此

xk+1=xkαf(xk)x^{k+1} = x^k - \alpha \nabla f(x^k)

步长选择

固定步长

假设ffLL-利普西茨连续的,即

f(x)f(y)Lxy\|\nabla f(x)-\nabla f(y)\|\leq L\|x-y\|

可以证明

2f(x)L\|\nabla^2 f(x)\|\leq L

根据二阶泰勒展开,有

f(xk+1)=f(xk)+f(xk),xk+1xk+12xk+1xk,2f(xk)(xk+1xk)f(xk)+f(xk),xk+1xk+L2xk+1xk2=f(xk)αf(xk)2+α2L2f(xk)2\begin{align*} f(x^{k+1})&=f(x^k)+\left\langle\nabla f(x^k),x^{k+1}-x^k\right\rangle+\frac 12\left\langle x^{k+1}-x^k,\nabla^2f(x^k)(x^{k+1}-x^k)\right\rangle\\ &\leq f(x^k)+\left\langle\nabla f(x^k),x^{k+1}-x^k\right\rangle+\frac L2\left\|x^{k+1}-x^k\right\|^2\\ &=f(x^k)-\alpha\left\|\nabla f(x^k) \right\|^2+\frac{\alpha^2 L}2\left\|\nabla f(x^k) \right\|^2 \end{align*}

由于我们希望f(xk+1)<f(xk)f(x^{k+1})<f(x^k),求解

αf(xk)2+α2L2f(xk)2<0-\alpha\left\|\nabla f(x^k) \right\|^2+\frac{\alpha^2 L}2\left\|\nabla f(x^k) \right\|^2< 0

得:α(0,2L)\alpha \in (0,\frac 2L)

精确线搜索

αk=argminα>0f(xkαf(xk))\alpha_k = \operatorname*{argmin}_{\alpha>0} f(x^{k}-\alpha\nabla f(x^k))

线搜索回退法(Backtracking Line Search)

在非精确线搜索算法中,选取αk\alpha_k需要满足一定的要求,这些要求被称为线搜索准则。Backtracking Line Search算法如下:

image-20250327222427808

Backtracking Line Search先选择一个初始步长,判断是否满足某个先搜索准则,如果满足,则表明我们找到的解是足够好的,可接受的,否则缩小α\alphaγ\gamma倍。这里使用的是Armijo准则

f(xk+αdk)f(xk)+c1αf(xk)Tdkf(x^k+\alpha d^k)\leqslant f(x^k)+c_1\alpha\nabla f(x^k)^\mathrm{T}d^k

image-20250327222725992

加速方法

Nesterov法

Nestrov算法由两步组成:第一步沿着前两步的计算方向计算一个新点, 第二步在该新点处做一步近似点梯度迭代:

image-20250327224232898 image-20250327224438982

k2k1\frac{k-2}{k-1}也可以替换为其它系数,例如可以选择wk1w_{k-1}

λk=12(1+1+4λk12),wk1=λk11λk\lambda_k = \frac{1}{2}\left(1 + \sqrt{1+4\lambda_{k-1}^2}\right),\quad w_{k-1}=\frac{\lambda_{k-1}-1}{\lambda_k}

Momentum法

vk+1=βkvk+f(xk)xk+1=xkαvk+1\begin{align*} v^{k+1} &= \beta_k v^k + \nabla f(x^k)\\ x^{k+1}&=x^k - \alpha v^{k+1} \end{align*}

共轭梯度法

在Momentum法的基础上,选取

βk=f(xk),f(xk)f(xk1)vk,f(xk)f(xk1)\beta_k = \frac{\left\langle \nabla f(x^k),\nabla f(x^k)-\nabla f(x^{k-1})\right\rangle}{\left\langle v^{k},\nabla f(x^k)-\nabla f(x^{k-1})\right\rangle}

应用

最小二乘法

minxRn12Axb2\min_{x\in\mathbb R^n}\frac12\|Ax-b\|^2

GD+精确线搜索

xk+1=xkαkA(Axkb)x^{k+1} = x^k -\alpha_kA^\top(Ax^{k}-b)

αk=argminα>0 g(α)=argminα>0 12A(xkαA(Axkb))b2=argminα>0 12Axkb2αAxkb,AA(Axkb)+α22AA(Axkb)2\begin{align*} \alpha_k &=\operatorname*{argmin}_{\alpha>0} ~g(\alpha)\\ &= \operatorname*{argmin}_{\alpha>0}~ \frac12\|A(x^{k}-\alpha A^\top(Ax^{k}-b))-b\|^2\\ &= \operatorname*{argmin}_{\alpha>0}~ \frac12\|Ax^{k}-b\|^2-\alpha\langle Ax^{k}-b,AA^\top(Ax^{k}-b)\rangle+\frac{\alpha^2}2\|AA^\top(Ax^{k}-b)\|^2\\ \end{align*}

假设AA的列向量是线性无关的,由于A(Axkb)0A^\top (Ax^k-b)\neq 0(不是最优值),因此

g(α)=AA(Axkb)2>0g^{\prime\prime}(\alpha) =\|AA^\top(Ax^{k}-b)\|^2>0

根据一阶最优性条件

g(α)=αAA(Axkb)2Axkb,AA(Axkb)=0α=Axkb,AA(Axkb)AA(Axkb)2=A(Axkb)2AA(Axkb)2\begin{gather*} g^\prime(\alpha) = \alpha\|AA^\top(Ax^{k}-b)\|^2-\langle Ax^{k}-b,AA^\top(Ax^{k}-b)\rangle=0\\ \Leftrightarrow \alpha = \frac{\langle Ax^{k}-b,AA^\top(Ax^{k}-b)\rangle}{\|AA^\top(Ax^{k}-b)\|^2} =\frac{\|A^\top(Ax^{k}-b)\|^2}{\|AA^\top(Ax^{k}-b)\|^2} \end{gather*}

Steepest Descent Algorithm

for k=0,1,2,gk=A(Axkb)αk=gk2Agk2xk+1=xkαkgkend\begin{align*} &\text{for}~ k=0,1,2,\cdots\\ &\quad\quad g^k = A^\top (Ax^k-b)\\ &\quad\quad \alpha_k = \frac{\|g^k\|^2}{\|Ag^k\|^2}\\ &\quad\quad x^{k+1} = x^k-\alpha_kg^k\\ &\text{end} \end{align*}

每个循环需要计算2次AA的矩阵-向量乘法和1次AA^\top的矩阵-向量乘法,但是可以减少到1次AA的矩阵-向量乘法和1次AA^\top的矩阵-向量乘法。具体而言,要储存gk1g^{k-1}Agk1Ag^{k-1}

for k=1,2,3gk=gk1αkA(Agk1)一次A矩阵-向量乘法αk=gk2Agk2 一次A矩阵-向量乘法xk+1=xkαkgkend\begin{align*} &\text{for}~ k=1,2,3\cdots\\ &\quad\quad g^k = g^{k-1}-\alpha_kA^\top(Ag^{k-1})\quad \text{一次}A^\top\text{矩阵-向量乘法}\\ &\quad\quad \alpha_k = \frac{\|g^k\|^2}{\|Ag^k\|^2}\quad\quad\quad\quad\quad\quad~ \text{一次}A\text{矩阵-向量乘法}\\ &\quad\quad x^{k+1} = x^k-\alpha_kg^k\\ &\text{end} \end{align*}

Agk1Ag^{k-1}写作uk1u_{k-1}

g0=A(Ax0b)u0=Ag0for k=1,2,3gk=gk1αkAuk1一次A矩阵-向量乘法uk=Agk   一次A矩阵-向量乘法αk=gk2uk2xk+1=xkαkgkend\begin{align*} &g^0=A^\top(Ax^0-b)\\ &u^0 = A g_0\\ &\text{for}~ k=1,2,3\cdots\\ &\quad\quad g^k = g^{k-1}-\alpha_kA^\top u^{k-1}\quad \text{一次}A^\top\text{矩阵-向量乘法}\\ &\quad\quad u^k = Ag^k\quad\quad\quad\quad\quad\quad~~~ \text{一次}A\text{矩阵-向量乘法}\\ &\quad\quad \alpha_k = \frac{\|g^k\|^2}{\|u^k\|^2}\quad \\ &\quad\quad x^{k+1} = x^k-\alpha_kg^k\\ &\text{end} \end{align*}

共轭梯度法

gk=A(Axb)pk=gk+βkpk1xk+1=xkαkpk\begin{align*} g^k &= A^\top (Ax-b)\\ p^k &=g^k+\beta_kp^{k-1}\\ x^{k+1}&=x^{k}-\alpha_kp^k \end{align*}

需要求解αk,βk\alpha_k,\beta_k使得f(xk+1)f(x^{k+1})最小。首先由于αk\alpha_k是最优的,因此

αk=argminαg(α)=argminα12A(xkαpk)b2=argminα12Axkb2αAxkb,Apk+α22Apk2\begin{align*} \alpha_k &= \operatorname*{argmin}_{\alpha} g(\alpha)\\ &=\operatorname*{argmin}_{\alpha} \frac12\|A(x^k-\alpha p^k)-b\|^2\\ &=\operatorname*{argmin}_{\alpha} \frac12\|Ax^k-b\|^2-\alpha\langle Ax^k-b,A p^k \rangle+\frac{\alpha^2}2\|A p^k\|^2\\ \end{align*}

根据一阶最优性条件,

g(α)=αApk2αAxkb,Apk=0α=Axkb,ApkApk2=A(Axkb),pkApk2=gk,pkApk2\begin{gather*} g^\prime(\alpha)=\alpha\|Ap^k\|^2-\alpha\langle Ax^k-b,A p^k \rangle=0\\ \Leftrightarrow \alpha = \frac{\langle Ax^k-b,A p^k \rangle}{\|Ap^k\|^2}=\frac{\langle A^\top (Ax^k-b),p^k \rangle}{\|Ap^k\|^2}=\frac{\langle g^k,p^k \rangle}{\|Ap^k\|^2} \end{gather*}

αk\alpha_k的这个最优形式可以得到:

gk+1,pk=A(A(xkαkpk)b),pk=AA(xkb)αkAApk,pk=gkαkAApk,pk=gk,pkαkAApk,pk=gk,pkgk,pkApk2Apk,Apk=0\begin{align*} \langle g^{k+1},p^k\rangle &= \langle A^\top (A(x^k-\alpha_k p_k)-b),p^k\rangle\\ &= \langle A^\top A(x^k-b)-\alpha_kA^\top A p_k,p^k\rangle\\ &= \langle g^k-\alpha_kA^\top A p_k,p^k\rangle\\ &=\langle g^k,p^k\rangle-\alpha_k\langle A^\top A p_k,p^k\rangle\\ &=\langle g^k,p^k\rangle-\frac{\langle g^k,p^k \rangle}{\|Ap^k\|^2}\langle A p_k,Ap^k\rangle =0 \end{align*}

接下来求解βk\beta_k

(αk,βk)=argminα,βE(α,β)=argminα,βf(xkα(gk+βpk1))(\alpha_k,\beta_k) = \operatorname*{argmin}_{\alpha,\beta} E(\alpha,\beta) =\operatorname*{argmin}_{\alpha,\beta}f\big(x^k-\alpha(g^k+\beta p^{k-1})\big)

denote θ=αβ\theta=\alpha \beta

E(α,θ)=12A(xkαgkθpk1)b2=12(bAxk)+αAgk+θApk1)2=12(bAxk2+α2Agk2+θ2Apk12+2αbAxk,Agk+2θbAxk,Apk1+2αθAgk,Apk1)\begin{align*} E(\alpha,\theta)&=\frac12\Big\|A(x^k-\alpha g^k-\theta p^{k-1})-b\Big\|^2\\ &=\frac12\Big\|(b-Ax^k)+\alpha A g^k+\theta A p^{k-1})\Big\|^2\\ &=\frac12\Big(\|b-Ax^k\|^2+\alpha^2\|Ag^k\|^2+\theta^2\|Ap^{k-1}\|^2\\&+2\alpha\langle b-Ax^k,Ag^k \rangle+2\theta\langle b-Ax^k,Ap^{k-1} \rangle +2\alpha\theta\langle Ag^k,Ap^{k-1} \rangle\Big) \end{align*}

Eθ=θApk12+θbAxk,Apk1+2αAgk,Apk1=θApk12+θA(bAx)k,pk1+αAgk,Apk1=θApk12+αAgk,Apk1=0\begin{align*} \frac{\partial E}{\partial \theta}&=\theta\|Ap^{k-1}\|^2+\theta\langle b-Ax^k,Ap^{k-1} \rangle+2\alpha\langle Ag^k,Ap^{k-1} \rangle\\ &=\theta\|Ap^{k-1}\|^2+\theta\langle A^\top(b-Ax)^k,p^{k-1} \rangle+\alpha\langle Ag^k,Ap^{k-1} \rangle\\ &=\theta\|Ap^{k-1}\|^2+\alpha\langle Ag^k,Ap^{k-1} \rangle=0 \end{align*}

θk=αAgk,Apk1Apk12βk=Agk,Apk1Apk12\theta_k=-\alpha\frac{\langle Ag^k,Ap^{k-1} \rangle}{\|Ap^{k-1}\|^2} \Leftrightarrow \beta_k=-\frac{\langle Ag^k,Ap^{k-1} \rangle}{\|Ap^{k-1}\|^2}

Conjugate Gradient for Least Squares

p1=0for k=0,1,2,gk=A(Axkb)βk=Agk,Apk1Apk12pk=gk+βkpk1αk=gk,pkApk2xk+1=xkαkpkend\begin{align*} &p^{-1}=0\\ &\text{for}~ k=0,1,2,\cdots\\ &\quad\quad g^k = A^\top (Ax^k-b)\\ &\quad\quad \beta_k=-\frac{\langle Ag^k,Ap^{k-1} \rangle}{\|Ap^{k-1}\|^2}\\ &\quad\quad p^k =g^k+\beta_kp^{k-1}\\ &\quad\quad \alpha_k = \frac{\langle g^k,p^k\rangle}{\|Ap^k\|^2}\\ &\quad\quad x^{k+1} = x^k-\alpha_kp^k\\ &\text{end} \end{align*}

线性方程组

minxRn Ax=bs.t. ARn×n is SPD\begin{align*} \min_{x\in\mathbb R^n} &~Ax=b\\ {\rm s.t.}&~A\in\mathbb R^{n\times n} ~\text{is SPD} \end{align*}

等价于

minxRn 12xAxxb\begin{align*} \min_{x\in\mathbb R^n} &~\frac12 x^\top Ax-x^\top b \end{align*}

Steepest Descent(GD+精确线搜索)

xk+1=xkαk(Axkb)x^{k+1} = x^{k} - \alpha_k(Ax^{k} - b)

其中

αk=argminα>0g(α)=argminα>0f(xkα(Axkb))=12(xkα(Axkb))A(xkα(Axkb))(xkα(Axkb))b=12(xk)Axk(xk)bα(xk)A(Axkb)+α22(Axkb)A(Axkb)+α(Axkb)b\begin{align} \alpha_{k}&=\operatorname*{argmin}_{\alpha>0}g(\alpha)\\ &=\operatorname*{argmin}_{\alpha>0}f(x^{k}-\alpha(Ax^{k}-b))\\ &=\frac12\big(x^{k}-\alpha(Ax^{k}-b)\big)^\top A\big(x^{k}-\alpha(Ax^{k}-b)\big)-\big(x^{k}-\alpha(Ax^{k}-b)\big)^\top b\\ &=\frac 12 (x^k)^\top Ax^k-(x^k)^\top b\\ &-\alpha(x^k)^\top A(Ax^{k}-b)+\frac{\alpha^2}2(Ax^k-b)^\top A(Ax^k-b)+\alpha(Ax^k-b)^\top b\\ \end{align}

g(α)=(xk)A(Axkb)+α(Axkb)A(Axkb)+(Axkb)b=((xk)Ab)(Axkb)+α(Axkb)A(Axkb)=0\begin{align*} g^\prime(\alpha)&=-(x^k)^\top A(Ax^k -b)+\alpha(Ax^k-b)^\top A(Ax^k-b)+(Ax^k-b)^\top b\\ &=-((x^k)^\top A -b^\top)(Ax^k -b) +\alpha(Ax^k-b)^\top A(Ax^k-b)=0 \end{align*}

αk=Axkb2Axkb,A(Axkb)\alpha_k =\frac{\|Ax^k-b\|^2}{\langle Ax^k-b,A(Ax^k-b)\rangle}

Steepest Descent for Linear Equation

for k=0,1,2,gk=Axkbαk=gk2gk,Agkxk+1=xkαkgkend\begin{align*} &\text{for}~ k=0,1,2,\cdots\\ &\quad\quad g^k = Ax^k-b\\ &\quad\quad \alpha_k = \frac{\|g^k\|^2}{\langle g^k,Ag^k\rangle}\\ &\quad\quad x^{k+1} = x^k-\alpha_kg^k\\ &\text{end} \end{align*}

共轭梯度法

gk=Axbpk=gk+βkpk1xk+1=xkαkpk\begin{align*} g^k &= Ax-b\\ p^k &=g^k+\beta_kp^{k-1}\\ x^{k+1}&=x^{k}-\alpha_kp^k \end{align*}

需要求解αk,βk\alpha_k,\beta_k使得f(xk+1)f(x^{k+1})最小。首先由于αk\alpha_k是最优的,类似的,我们有:

αk=argminα>0f(xkαkpk)=p(k),g(k)pk,Agk\begin{align} \alpha_{k}&=\arg\min_{\alpha>0}f(x^{k}-\alpha_{k}p^{k})\\&=\frac{\langle p^{(k)},g^{(k)}\rangle}{\langle p^{k},Ag^{k}\rangle} \end{align}

以及gk,pk1=0\langle g^{k},p^{k-1}\rangle=0

接下来求解βk\beta_k

(αk,βk)=argminα,βE(α,β)=argminα,βf(xkα(gk+βpk1))(\alpha_k,\beta_k) = \operatorname*{argmin}_{\alpha,\beta} E(\alpha,\beta) =\operatorname*{argmin}_{\alpha,\beta}f\big(x^k-\alpha(g^k+\beta p^{k-1})\big)

展开,带入gk,pk1=0\langle g^{k},p^{k-1}\rangle=0,得:

βk=gk,Apk1pk1,Apk1\beta_k = -\frac{\langle g^{k},Ap^{k-1}\rangle}{\langle p^{k-1},Ap^{k-1}\rangle}

p1=0for k=0,1,2,gk=Axkbβk=gk,Apk1pk1,Apk1pk=gk+βkpk1αk=p(k),g(k)pk,Agkxk+1=xkαkpkend\begin{align*} &p^{-1}=0\\ &\text{for}~ k=0,1,2,\cdots\\ &\quad\quad g^k = Ax^k-b\\ &\quad\quad \beta_k=-\frac{\langle g^{k},Ap^{k-1}\rangle}{\langle p^{k-1},Ap^{k-1}\rangle}\\ &\quad\quad p^k =g^k+\beta_kp^{k-1}\\ &\quad\quad \alpha_k = \frac{\langle p^{(k)},g^{(k)}\rangle}{\langle p^{k},Ag^{k}\rangle}\\ &\quad\quad x^{k+1} = x^k-\alpha_kp^k\\ &\text{end} \end{align*}