# 随机森林和mRMR特征选择

## Random Forest

Gini指数偏向于多值属性，并且当类的数量很大时会有困难，而且它还倾向于导致相等大小的分区和纯度。但实践效果不错。

\$\$IG(D, A) = H(D) – H(D|A)\$\$

• 首先建立m棵决策树，然后分别计算每棵树的OOB袋外误差errOOBj。
• 计算特征\$x_i \$的重要性。随机的修改OOB中的每个特征\$x_i \$的值，再次计算它的袋外误差errOOBi；\$x_i 的重要性=/sum/frac{errOOBi-errOOBj}{Ntree}\$.
• 按照特征的重要性排序，然后剔除后面不重要的特征；
• 然后重复以上步骤，直到选出m个特征。

`from sklearn.datasets import load_bostonfrom sklearn.ensemble import RandomForestRegressorimport numpy as np#Load boston housing dataset as an exampleboston = load_boston()X = boston["data"]Y = boston["target"]names = boston["feature_names"]rf = RandomForestRegressor()rf.fit(X, Y)print "Features sorted by their score:"print sorted(zip(map(lambda x: round(x, 4), rf.feature_importances_), names),              reverse=True)`

Features sorted by their score:[(0.5298, ‘LSTAT’), (0.4116, ‘RM’), (0.0252, ‘DIS’), (0.0172, ‘CRIM’), (0.0065, ‘NOX’), (0.0035, ‘PTRATIO’), (0.0021, ‘TAX’), (0.0017, ‘AGE’), (0.0012, ‘B’), (0.0008, ‘INDUS’), (0.0004, ‘RAD’), (0.0001, ‘CHAS’), (0.0, ‘ZN’)]

## mRMR

No.1 最大相关性

No.2 最小冗余性

\$\$max / /Phi(D,R) , /Phi = D – R\$\$