机器学习算法浅析 - Yongsheng Xiao

SGD
支持向量机
线性回归
- 初步印象
- 补充纠正
RNN与LSTM
- 初步印象
协同过滤(Collaborative Filtering)
- 初步印象
参考文献

从6月初开始，就一直想着把各类机器学习方法进行一番总结，但是开始下手的时候又觉得无从下手，感觉动手写的时候，一定是按照各类教材照搬照抄，好像不是我想要的结果，所以一直搁置至今，如今着手开始并非已经形成了自己的见解，实在是因为再开始动手，就错过了这段难得空闲的时间，所以…先试试看吧！

- 2017年6月29日于宁波

SGD

SGD has been successfully applied to large-scale and sparse machine learning problems often encountered in text classification and natural language processing. Given that the data is sparse, the classifiers in this module easily scale to problems with more than 10^5 training examples and more than 10^5 features. The advantages of Stochastic Gradient Descent are:

Efficiency.

Ease of implementation (lots of opportunities for code tuning).

The disadvantages of Stochastic Gradient Descent include:

SGD requires a number of hyperparameters such as the regularization parameter and the number of iterations.

SGD is sensitive to feature scaling.

以上来自sklearn的官方文档¹,SGD Classification是一种线性分类器，上文主要说明一下问题:

适用场景
- SGD适用于大数据量；
- SGD适用于超多的特征维度；
- SGD适用于稀疏数据；
- SGD在文本分类和自然语言处理上有优势；
优势
- 高效
- 易于实现
缺点
- 超参数较多(如正则化参数及迭代次数等)
- 需要做标准化

支持向量机

The advantages of support vector machines are:

Effective in high dimensional spaces.

Still effective in cases where number of dimensions is greater than the number of samples.

Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.

Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.

The disadvantages of support vector machines include:

If the number of features is much greater than the number of samples, the method is likely to give poor performances.

SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation (see Scores and probabilities, below).

摘自scikit-learn²:

优点
- 高维空间有效
- 允许维度比样本数多
- 仅由支持向量做决策
- 和函数较多甚至可以自定义
缺点
- 若维度远超样本数，性能较差
- 不能直接提供概率
- 最好是做一下标准化

线性回归

初步印象

机器学习方法最重要的是评估方法，即如何衡量一个模型的好坏优劣；分类问题的衡量标准有准确率， precision / recall / F1，AUC等等，回归问题的衡量标准即损失函数千奇百怪但一般都是可微分的，如均方误差等等。

线性回归是最简单的分类/回归方法，若是采用线性回归去做分类，以 $W^Tx+b$ 作为判断依据，若样本x满足

$W^Tx+b>0$

则该样本判定为正类，否则为反类，选取一组使训练集的分类准确率最高的W和b即可；

线性回归做回归的时候就是选取一组W和b使得各个样本点与

$\hat{y}=W^Tx+b$

之间的距离最小即可，其损失函数可以表示为：

$L(y, \hat{y}) = \frac {1} {2} (y-\hat{y})^2$

如此，必然是参数越多越复杂的模型表现越好，这就会导致过拟合的产生，所以过拟合是必然要解决的大问题， 正则化 就是为此而生的，本人所知的正则化项包括三种：

$L_0$ 正则化项：非0参数的个数；
$L_1$ 正则化项：各参数绝对值之和；
$L_2$ 正则化项：各参数绝对值的平方和的开方；

$L(y, \hat{y}) = \frac {1} {2} (y-\hat{y})^2 + \lambda L(W)$

该损失函数代表着优化目标有两个：误差越小越好，模型越简单越好；

补充纠正

cost function:

$J(\theta) = \frac {1} {2m} \sum_{1}^{m}(h_\theta (x^i) - y^i)^2$

如何选取最优参数？

Gradient Desent

$\theta_j := \theta_j - \alpha \frac {\partial} {\partial{\theta_j}} J(\theta)$

梯度下降求得局部最小值；
梯度下降的学习率
Normal Equation

RNN与LSTM

初步印象

RNN是专为序列数据而生的，RNN与常见的神经网络的区别在于RNN的hidden layer有一个额外的memory cell用于存储该hidden layer的输出结果，而hidden layer的输入包含两个部分：一是上一轮memory cell中保存的内容，二是本轮输入信息，两者共同作用下，经过active function产生输出。

LSTM叫做long short-term memory模型，是比较长的短期记忆模型，相比于RNN，其优势在于可以保留较长时间以前的信息，而RNN做不到这点,这个功能的实现依赖于三个门限函数：input gate, forget gate, output gate; 此三者再加上输入信息, 共同作用下产生输出:输入数据由input gate控制保留多少信息，再经过avtive function，与上一轮cell当中的信息通过 forget gate后求和作为新的memory cell当中的内容，再通过output gate, 输出结果；

协同过滤(Collaborative Filtering)

初步印象

协同过滤是最常用的推荐算法，主要包括两种:

User CF：以用户对所有物品的偏好为向量，计算用户之间的相似程度(欧几里德距离或是相关系数等)，为用户推荐与其最为相似的用户喜欢的物品；
Item CF：以所有用户对物品的偏好为向量，计算物品之间的相似程度，为用户推荐与其喜欢的物品最为相似的物品；

参考文献

Stochastic Gradient Descent — scikit-learn 0.18.2 documentation. (2017). Scikit-learn.org. Retrieved 4 August 2017, from http://scikit-learn.org/stable/modules/sgd.html#classification ↩
Support Vector Machines — scikit-learn 0.18.2 documentation. (2017). Scikit-learn.org. Retrieved 4 August 2017, from http://scikit-learn.org/stable/modules/svm.html#classification ↩

生若直木，不语斧凿.

目录

SGD

支持向量机

线性回归

初步印象

补充纠正

RNN与LSTM

初步印象

协同过滤(Collaborative Filtering)

初步印象

参考文献