数据分析或挖掘中,探查事件的驱动因素和特征选择,都牵涉到变量重要性的问题,scikit-learn 树模型相关,都提供了对变量重要性的判定,比如:http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html
但是这个变量重要性到底怎么算的呢?
http://scikit-learn.org/stable/modules/tree.html中介绍了gini和信息熵的计算
http://stackoverflow.com/questions/15810339/how-are-feature-importances-in-randomforestclassifier-determined 中提到这些
通过上述描述,找到这个http://papers.nips.cc/paper/4928-understanding-variable-importances-in-forests-of-randomized-trees.pdf
这里给出了例子http://stats.stackexchange.com/questions/92419/relative-importance-of-a-set-of-predictors-in-a-random-forests-classification-in
这下终于搞明白了
再看一个例子
http://blog.datadive.net/selecting-good-features-part-iii-random-forests/
No comments:
Post a Comment