米粒: 随机森林变量重要性

Wednesday, March 9, 2016

随机森林变量重要性

数据分析或挖掘中，探查事件的驱动因素和特征选择，都牵涉到变量重要性的问题，scikit-learn 树模型相关，都提供了对变量重要性的判定，比如：http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html

但是这个变量重要性到底怎么算的呢？

http://scikit-learn.org/stable/modules/tree.html中介绍了gini和信息熵的计算

http://stackoverflow.com/questions/15810339/how-are-feature-importances-in-randomforestclassifier-determined 中提到这些

通过上述描述，找到这个http://papers.nips.cc/paper/4928-understanding-variable-importances-in-forests-of-randomized-trees.pdf

这里给出了例子http://stats.stackexchange.com/questions/92419/relative-importance-of-a-set-of-predictors-in-a-random-forests-classification-in

这下终于搞明白了
再看一个例子
http://blog.datadive.net/selecting-good-features-part-iii-random-forests/

米粒

Wednesday, March 9, 2016

随机森林变量重要性

No comments:

Post a Comment