Wednesday, March 9, 2016

随机森林变量重要性

数据分析或挖掘中,探查事件的驱动因素和特征选择,都牵涉到变量重要性的问题,scikit-learn 树模型相关,都提供了对变量重要性的判定,比如:http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html

但是这个变量重要性到底怎么算的呢?


http://scikit-learn.org/stable/modules/tree.html中介绍了gini和信息熵的计算

http://stackoverflow.com/questions/15810339/how-are-feature-importances-in-randomforestclassifier-determined 中提到这些

通过上述描述,找到这个http://papers.nips.cc/paper/4928-understanding-variable-importances-in-forests-of-randomized-trees.pdf



这里给出了例子http://stats.stackexchange.com/questions/92419/relative-importance-of-a-set-of-predictors-in-a-random-forests-classification-in




这下终于搞明白了
再看一个例子
http://blog.datadive.net/selecting-good-features-part-iii-random-forests/

No comments:

Post a Comment