[SPARK-5133] Feature Importance for Random Forests - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.5.0
Component/s: ML, MLlib
Labels:
None

Target Version/s:

1.5.0

Description

Add feature importance to random forest models.
If people are interested in this feature I could implement it given a mentor (API decisions, etc). Please find a description of the feature below:

Decision trees intrinsically perform feature selection by selecting appropriate split points. This information can be used to assess the relative importance of a feature.
Relative feature importance gives valuable insight into a decision tree or tree ensemble and can even be used for feature selection.

More information on feature importance (via decrease in impurity) can be found in ESLII (10.13.1) or here [1].
R's randomForest package uses a different technique for assessing variable importance that is based on permutation tests.

All necessary information to create relative importance scores should be available in the tree representation (class Node; split, impurity gain, (weighted) nr of samples?).

[1] http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation

Attachments

Issue Links

is blocked by

SPARK-6885 Decision trees: predict class probabilities

Resolved

is related to

SPARK-7674 R-like stats for ML models

Resolved

relates to

SPARK-9904 User guide for ML tree algorithms

Closed

links to

[Github] Pull Request #7838 (jkbradley)

Sub-Tasks

Expose featureImportances on org.apache.spark.mllib.tree.RandomForest

Resolved

Unassigned

Activity

People

Assignee:: Joseph K. Bradley

Reporter:: Peter Prettenhofer

Shepherd:: Yanbo Liang

Votes:: 3 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 07/Jan/15 13:30

Updated:: 12/Aug/15 21:28

Resolved:: 03/Aug/15 19:18

Time Tracking

Estimated:

168h

Remaining:

168h

Logged:

Not Specified

Include sub-tasks