Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-15699

Add chi-squared test statistic as a split quality metric for decision trees

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Minor
    • Resolution: Incomplete
    • 2.0.0
    • None
    • ML, MLlib

    Description

      Using test statistics as a measure of decision tree split quality is a useful split halting measure that can yield improved model quality. I am proposing to add the chi-squared test statistic as a new impurity option (in addition to "gini" and "entropy") for classification decision trees and ensembles.

      I wrote a blog post that explains some useful properties of test-statistics for measuring split quality, with some example results:
      http://erikerlandson.github.io/blog/2016/05/26/measuring-decision-tree-split-quality-with-test-statistic-p-values/

      (Other test statistics are also possible, for example using the Welch's t-test variant for regression trees, but they could be addressed separately)

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              eje Erik Erlandson
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: