Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-5133

Feature Importance for Random Forests

Rank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.5.0
    • ML, MLlib
    • None

    Description

      Add feature importance to random forest models.
      If people are interested in this feature I could implement it given a mentor (API decisions, etc). Please find a description of the feature below:

      Decision trees intrinsically perform feature selection by selecting appropriate split points. This information can be used to assess the relative importance of a feature.
      Relative feature importance gives valuable insight into a decision tree or tree ensemble and can even be used for feature selection.

      More information on feature importance (via decrease in impurity) can be found in ESLII (10.13.1) or here [1].
      R's randomForest package uses a different technique for assessing variable importance that is based on permutation tests.

      All necessary information to create relative importance scores should be available in the tree representation (class Node; split, impurity gain, (weighted) nr of samples?).

      [1] http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            josephkb Joseph K. Bradley
            pprett Peter Prettenhofer
            Yanbo Liang Yanbo Liang
            Votes:
            3 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 168h
                168h
                Remaining:
                Remaining Estimate - 168h
                168h
                Logged:
                Time Spent - Not Specified
                Not Specified

                Slack

                  Issue deployment