# Online Learning算法理论与实践

## 什么是Online Learning

Online Learning有点像自动控制系统，但又不尽相同，二者的区别是：Online Learning的优化目标是整体的损失函数最小化，而自动控制系统要求最终结果与期望值的偏差最小。

## 怎样实现Online Learning

#### Bayesian Online Learning

\$\$ p/left(/mu /right) = /operatorname{Beta}(/alpha, /beta)\$\$

\$\$p/left( /mu | Y=1 /right) = /operatorname{Beta}(/alpha+1, /beta)\$\$

\$\$ p/left(/mu /right | Y=0) = /operatorname{Beta}(/alpha, /beta+1)\$\$

/( /alpha = /alpha + 1 /)

/( /beta = /beta + 1 /)

\$\$ /mu = /frac{ /alpha + H}{/alpha + /beta + N} \$\$

\$\$ /mathrm{log}/left[ p/left(/mu /mid /alpha,/beta /right) /cdot p /left( Y = 1 /mid /mu /right)^{H} /cdot p /left( Y = 0 /mid /mu /right)^{T} /right] \$\$

\$\$p/left(Y /mid /mu/right) = N/left( Y /mid /mu,/sigma^2 /right)\$\$

\$\$ p/left( /mu /right) = N/left( /mu /mid m ,v ^2 /right) \$\$

\$\$ p /left( /mu /mid Y_i /right) = N/left( /mu /mid /frac{Y_i v^{2}+m/sigma^{2}}{/sigma^{2}+v^{2}} , /frac{/sigma^{2}v^{2}}{/sigma^{2}+v^{2}} /right) \$\$

\$\$ m = /frac{Y_i v^{2} + m /sigma^{2}}{/sigma^{2} + v^{2}} \$\$

\$\$ v^{2} = /frac{/sigma^{2} v^{2}}{/sigma^{2} + v^{2} } \$\$

\$\$ p( /mu /mid Y_i ＝1) = /frac{ I(/mu > 0 )p(/mu)}{ /int_{0}^{+/infty} p(/mu)/mathrm{d}u}\$\$

\$\$ p( /mu /mid Y_i ＝-1) = /frac{ I(/mu< 0) p(/mu)}{ /int_{-/infty}^{0} p(/mu)/mathrm{d}u}\$\$

\$\$ KL( p(/mu /mid Y_i =1) || N(/mu /mid /tilde m ,/tilde v^{2})) \$\$

\$\$/tilde m = m + v /cdot /upsilon/left(/frac{m}{v}/right)\$\$

\$\$/tilde v^{2} = v^2/left(1 – /omega/left(/frac{m}{v}/right)/right)\$\$

\$\$ KL( p(/mu /mid Y_i =-1) || N(/mu /mid /tilde /mu ,/tilde v^{2})) \$\$

\$\$/tilde m = m – v /cdot /upsilon/left(-/frac{m}{v}/right)\$\$

\$\$/tilde v^{2} = v^2/left(1 – /omega/left(-/frac{m}{v}/right)/right)\$\$

\$\$/tilde m = m + Y_{i} v /cdot /upsilon/left(Y_{i}/frac{m}{v}/right)\$\$

\$\$/tilde v^{2} = v^2/left(1 – /omega/left(Y_{i}/frac{m}{v}/right)/right)\$\$

\$\$/upsilon(t) = /frac{/phi(t)}{/Phi(t)}\$\$

\$\$/phi(t) = /frac{1}{2/pi} /mathrm{exp} /left(-/frac{1}{2} t^2/right) \$\$

\$\$/Phi(t)=/int_{-/infty}^{t} /phi(t)/mathrm{d}t\$\$

\$\$/omega(t) = /upsilon(t)*(t-/upsilon(t))\$\$

\$\$ m = m + Y_{i} /cdot v /cdot /upsilon/left(Y_{i} /cdot /frac{m}{v}/right) \$\$

\$\$ v^{2} = v^2/left(1 – /omega/left(Y_{i} /cdot /frac{m}{v}/right)/right) \$\$

Bayesian Online Learning最常见的应用就是BPR（Bayesian Probit Regression）。

#### BPR

/(x/)是满足多维高斯分布

\$\$p /left( x /right) = N /left(x /mid /mu_x, /Sigma_x /right) \$\$

/(y/)是/(x/)通过线性变换加入随机扰动/(/Sigma_y /)得到的变量

\$\$p /left(y /mid x /right) = N /left(y /mid Ax+b ,/Sigma_y /right)\$\$

\$\$ p /left( y /right) = N /left( y /mid A/mu_X +b, /Sigma_y + A /Sigma_x A^{T} /right)\$\$

\$\$ /mu = /left[ /mu_1,/mu_2,…,/mu_D/right]^{/mathrm{T}}\$\$

\$\$

/Sigma = /left[ /begin{matrix}

/sigma_1^{2} & 0 & /ldots & 0 /newline

0 & /sigma_2^{2} & /ldots & 0 /newline

/vdots &/vdots & /ddots & /vdots /newline

0 & 0 & /ldots & /sigma_D^{2} /newline

/end{matrix} /right]

\$\$

/(Y/)是一维变量，是/(w/)与特征向量/(x/)的内积，加入方差为/( /beta^{2}/)的扰动

\$\$p/left( y /mid w/right) = N(y /mid x^Tw, /beta^2)\$\$

\$\$p/left( y /mid w/right) = N(y /mid x^T/mu, x^T/Sigma x +/beta^2)\$\$

\$\$

p/left(y/mid Y_i/right) = N/left(y/mid /tilde m, /tilde v^2 /right)\$\$

\$\$/tilde m = x^T/mu + Y_{i}/upsilon /left(Y_{i} /cdot /frac{x^T/mu}{/sqrt{x^T/Sigma x +/beta^2}}/right)\$\$

\$\$/tilde v^2 = /left(x^T/Sigma x +/beta^2/right)/left(1-/omega/left(Y_{i} /cdot /frac{x^T/mu}{/sqrt{x^T/Sigma x +/beta^2}} /right)/right) \$\$

\$\$p/left(w /mid y /right) /propto p/left(y /mid w /right) p/left(w/right)\$\$

\$\$ p/left( w_{d} /mid y /right) = N/left( w_{d} /mid /tilde /mu_{d},/tilde /sigma_{d} /right)\$\$

\$\$ /tilde /mu_{d} = /mu_{d} + Y_{i} x_{i,d}/cdot /frac {/sigma_{d}^{2} }{/sqrt{x^T/Sigma x +/beta^2}} /cdot /upsilon /left(Y_{i} /cdot /frac{x^T/mu}{/sqrt{x^T/Sigma x +/beta^2}}/right) \$\$

\$\$ /tilde /sigma_{d} = /sigma_{d} /cdot /left[ 1 – x_{i,d} /cdot /frac{/sigma_{d}^{2}}{x^T/Sigma x +/beta^2} /omega/left(Y_{i} /cdot /frac{x^T/mu}{/sqrt{x^T/Sigma x +/beta^2}} /right) /right]\$\$

Online Bayesian Probit Regression 训练流程如下：

for d = 1 … D

\$\$ /mu_{d} = /mu_{d} + Y_{i} x_{i,d}/cdot /frac {/sigma_{d}^{2} }{/sqrt{x_{i}^T/Sigma x_{i} +/beta^2}} /cdot /upsilon /left(Y_{i} /cdot /frac{x_{i}^T/mu}{/sqrt{x_{i}^T/Sigma x_{i} +/beta^2}}/right) \$\$

\$\$ /sigma_{d} = /sigma_{d} /cdot /left[ 1 – x_{i,d} /cdot /frac{/sigma_{d}^{2}}{x_{i}^T/Sigma x_{i} +/beta^2} /omega/left(Y_{i} /cdot /frac{x_{i}^T/mu}{/sqrt{x_{i}^T/Sigma x +/beta^2}} /right) /right]\$\$

#### FTRL

\$\$ w = argmin_{w} /sum_{i=1}^{t} f_i /left (w /right) \$\$

FTRL算法就是在FTL的优化目标的基础上，加入了正规化，防止过拟合：

\$\$ w = argmin_{w} /sum_{i=1}^{t} f_i /left (w /right) + R(w) \$\$

FTRL算法的损失函数，一般也不是能够很快求解的，这种情况下，一般需要找一个代理的损失函数。

1. 代理损失函数比较容易求解，最好是有解析解
2. 优化代理损失函数求的解，和优化原函数得到的解差距不能太大

\$\$ w_{t} = argmin_{w} h_{t-1} /left (w /right) \$\$\$\$Regret_{t} =/sum_{t=1}^{T}f_{t}/left(w_{t}/right) – /sum_{t=1}^{T}f_{t}/left(w^{*}/right)\$\$

\$\$ /lim_{t /rightarrow /infty } /frac{Regret_t}{t} = 0 \$\$

\$\$ h_{t} = /sum_{i=1}^{t} g_{i}/cdot w + /sum_{i=1}^{t} /left(/frac{1}{2 /eta_{t}} – /frac{1}{2 /eta_{t-1}} /right)||w – w_{t}||^{2}\$\$

\$\$

/eta_{t} = /frac{/alpha}{/sqrt{/sum_{i=1}^{t} g_{t}^{2}}}

\$\$

\$\$ h_{t} = /sum_{i=1}^{t} g_{i}/cdot w + /sum_{i=1}^{t} /left(/frac{1}{2 /eta_{t}} – /frac{1}{2 /eta_{t-1}} /right)||w – w_{t}||^{2} ＋ /lambda_{1}|w|\$\$只要/( f_t(w) /) 是凸函数，上面的代理函数一定满足：

\$\$ /lim_{t /rightarrow /infty } /frac{Regret_t}{t} = 0 \$\$

\$\$

w_{t+1,i} = /left/{

/begin{array}{ll}

0 & |z_{t,i}| < /lambda_{1} /newline

-/eta_{t}(z_{t,i} – sgn(z_{t,i})/lambda_{1}) ) & otherwise

/end{array}

/right.

\$\$

\$\$

z_{t,i} = /sum_{s=1}^{t}g_{s,i} + /sum_{s=1}^{t}/left( /frac{1}{ /eta_{t,i}} – /frac{1}{ /eta_{t-1,i}} /right) w_{t,i}

\$\$

for t = 1 … T

\$\$ g_{t,i} = /frac{/partial f_{i}/left(w_{t-1}/right)}{w_{t-1,i}}\$\$

\$\$

z_{t} += g_{t,i} + /frac{1}{/alpha} /left( /sqrt{n_{i} + g_{t,i}^{2}} -/sqrt{ n_{i} } /right) w_{t,i}

\$\$

\$\$ n_{i} += g_{t,i}^{2}\$\$更新

\$\$

w_{t+1,i} = /left/{

/begin{array}{ll}

0 & |z_{t,i}| < /lambda_{1} /newline

-/eta_{t}(z_{t,i} – sgn(z_{t,i})/lambda_{1}) ) & otherwise

/end{array}

/right.

\$\$

### Online Learning实践

FTRL 0.8432 200W
BPR 0.8441 1500W

BPR的效果略好，但是我们线上选用了FTRL模型，主要原因是FTRL能够产生稀疏化的效果，训练出的模型会比较小。

## 参考资料

• [1] McMahan H B, Holt G, Sculley D, et al. Ad Click Prediction: a View from the Trenches. Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 2013.
• [2] Graepel T, Candela J Q, Borchert T,et al. Web-Scale Bayesian Click-Through Rate Prediction for Sponsored Search Advertising in Microsoft’s Bing Search Engine. Proceedings of the 27th International Conference on Machine Learning ICML. 2010.
• [3] Murphy K P, Machine Learning A Probabilistic Perspective. The MIT Press. 2012.