Machine Learning第十周笔记：大规模机器学习

刚刚完成了Andrew Ng在Cousera上的Machine Learning的第十周课程，这周主要介绍的是大规模机器学习，现将笔记整理在下面。Gradient Descent with Large DatasetsLearning With Large Datasets在前面介绍bias-variance的时候，我们曾提到一个比较各种算法孰优孰劣的实验，结论是”it’s not who has

Marcovaldo

8214人浏览 · 2016-04-14 12:32:30

Marcovaldo · 2016-04-14 12:32:30 发布

博客已经迁移到Marcovaldo’s blog (http://marcovaldong.github.io/)

刚刚完成了Andrew Ng在Cousera上的Machine Learning的第十周课程，这周主要介绍的是大规模机器学习，现将笔记整理在下面。

Gradient Descent with Large Datasets

Learning With Large Datasets

在前面介绍bias-variance的时候，我们曾提到一个比较各种算法孰优孰劣的实验，结论是”it’s not who has the best algorithm that wins, it’s who has the most data.”但处理大规模的数据库时往往面临着计算问题。假设有一个数据量 m=100,000,000 <script id="MathJax-Element-1" type="math/tex">m=100,000,000</script>的数据库，那么在做Gradient descent时，对于每一个 θj <script id="MathJax-Element-2" type="math/tex">\theta_j</script>都有

θ j : = θ j - α 1 m \sum i = 1 m (h θ (x (i)) - y (i)) x (i) j

m=100,000,000 <script id="MathJax-Element-4" type="math/tex">m=100,000,000</script>次，如果模型中有

n=1000 <script id="MathJax-Element-5" type="math/tex">n=1000</script>个特征，进行一次迭代的计算量将是

O(1011) <script id="MathJax-Element-6" type="math/tex">O(10^11)</script>，整个算法的训练将耗费大量的时间。现在我们使用training set的一个很小的子集去训练模型，看看这样是否会收到好的效果。下图中给出了

Jtrain(θ) <script id="MathJax-Element-7" type="math/tex">J_{train}(\theta)</script>和

Jcv(θ) <script id="MathJax-Element-8" type="math/tex">J_{cv}(\theta)</script>随m取值变化的图像，如果在小的子集上训练得来的

Jcv(θ) <script id="MathJax-Element-9" type="math/tex">J_{cv}(\theta)</script>比

Jtrain(θ) <script id="MathJax-Element-10" type="math/tex">J_{train}(\theta)</script>还大得多，那增大training set会继续改善模型；如果在小的子集上得来的

Jcv(θ) <script id="MathJax-Element-11" type="math/tex">J_{cv}(\theta)</script>和

Jtrain(θ) <script id="MathJax-Element-12" type="math/tex">J_{train}(\theta)</script>已经基本相当，那我们就没有必要再去增大training set了。（具体原因可查看第六周的bias-variance）

10_1

Stochastic Gradient Descent

以线性回归的梯度下降过程为例来介绍随机梯度下降（stochastic gradient descent）。下图给出了线性回归的hypothesis、损失函数和梯度下降过程：

10_2

现在面临的问题是，m值得特别大导致 θj <script id="MathJax-Element-13" type="math/tex">\theta_j</script>的每一次更新需要花费高昂的计算开销和内存开销，而且往往需要递归很多次才能得到一个好的hypothesis。例如， m=300,000,000 <script id="MathJax-Element-14" type="math/tex">m=300,000,000</script>（美国人口的数量），计算一次 θj <script id="MathJax-Element-15" type="math/tex">\theta_j</script>就需要300,000,000次的加减运算。我们将这样的原始梯度下降法称为批量梯度下降（batch gradient descent）。

在随机梯度下降中，我们定义模型在数据点( x(i) <script id="MathJax-Element-16" type="math/tex">x^{(i)}</script>, y(i) <script id="MathJax-Element-17" type="math/tex">y^{(i)}</script>)的损失为

c o s t (θ, (x (i), y (i))) = 1 2 (h θ (x (i)) - y (i)) 2

J t r a i n (θ) = 1 m \sum i = 1 m c o s t (θ, (x (i), y (i)))

θ <script id="MathJax-Element-20" type="math/tex">\theta</script>。注意，此处与批量梯度下降的不同在于，批量梯度下降的参数

θ <script id="MathJax-Element-21" type="math/tex">\theta</script>的每次改进都是在所有的数据点上，而随机梯度下降的参数

θ <script id="MathJax-Element-22" type="math/tex">\theta</script>的每次改进则是在一个数据点上的。差别在于，批量梯度下降的每次改进都是基于整个training set做的，因此每次改进都是奔着全局最优去的，但计算开销太大；而随机梯度下降的每次改进是基于某个数据点做的，计算开销小，不能保证每次改进都是奔着全局最优去的，但总体方向仍然是全局最优。下图右侧图中有两种方式的迭代路线示意，批量梯度下降是红色所示（大半被遮住了），直奔全局最优；而随机梯度下降是粉色所示，可能要经过很多次参数改进（没关系，每次更新代价小，其实更快）才能达到最优。总体来讲，随机梯度下降比较快，且精度不比批量梯度下降差。

10_3

Mini-Batch Gradient Descent

前面我们提到，批量梯度下降对参数 θ <script id="MathJax-Element-23" type="math/tex">\theta</script>的每次改进都是基于整个training set做的，而随机梯度下降对参数的改进都是基于某一个数据点做的。那么，我们为什么不可以做一个折中呢？这就引出了mini-batch gradient descent，其对 θ <script id="MathJax-Element-24" type="math/tex">\theta</script>的改进是在b（b=2~100）个数据点上做的，这样既保证每次改进基本上都是向着全局最优的方向，又有了比较快的速度。下图给出了在一个m=1000的training set上做mini-btach gradient descent的过程：

10_4

Stochastic Gradient Descent Convergence

折一小节我们来介绍判断随机梯度下降是否收敛的技巧。前面我们说用 cost(θ,(x(i),y(i))) <script id="MathJax-Element-25" type="math/tex">cost(\theta,(x^{(i)},y^{(i)}))</script>来表示hypothesis在数据点 (x(i),y(i)) <script id="MathJax-Element-26" type="math/tex">(x^{(i)},y^{(i)})</script>处的损失，现在我们计算随机梯度下降每改进 θ <script id="MathJax-Element-27" type="math/tex">\theta</script>1000次（用到了1000个数据点），hypothesis在这1000个数据点上平均损失，然后观察其变化情况。下图给出了这个平均损失随迭代次数的变化情况。左上的图像中蓝色线和红色线分别表示一个大的learning rate和一个小的learning rate对这个 θ <script id="MathJax-Element-28" type="math/tex">\theta</script>每改进1000次的平均损失变化的影响。右上的图像中蓝色线和红色线分别表示 θ <script id="MathJax-Element-29" type="math/tex">\theta</script>每改进1000次计算一次平均损失和 θ <script id="MathJax-Element-30" type="math/tex">\theta</script>每改进5000次计算一次平均损失的区别（5000的对应的图像更平滑）。左下图像的蓝色线表示 θ <script id="MathJax-Element-31" type="math/tex">\theta</script>每改进1000次计算一次的平均损失随迭代次数的变化情况，此时我们发现算法好像不收敛，改成 θ <script id="MathJax-Element-32" type="math/tex">\theta</script>每改进5000次计算一次平均损失，再绘制图像（图中红色线）可能就会发现，算法在慢慢收敛。若发现更改后图像如图中粉色线所示，那说明这个算法应该不适用，我们需要考虑其他的算法了。如果图像如右下所示，呢就应该换一个小的learning rate。