Details
-
New Feature
-
Status: Resolved
-
Minor
-
Resolution: Incomplete
-
2.0.0
-
None
Description
Using test statistics as a measure of decision tree split quality is a useful split halting measure that can yield improved model quality. I am proposing to add the chi-squared test statistic as a new impurity option (in addition to "gini" and "entropy") for classification decision trees and ensembles.
I wrote a blog post that explains some useful properties of test-statistics for measuring split quality, with some example results:
http://erikerlandson.github.io/blog/2016/05/26/measuring-decision-tree-split-quality-with-test-statistic-p-values/
(Other test statistics are also possible, for example using the Welch's t-test variant for regression trees, but they could be addressed separately)
Attachments
Issue Links
- Is contained by
-
SPARK-14045 DecisionTree improvement umbrella
- Resolved
- is related to
-
SPARK-13868 Random forest accuracy exploration
- Resolved
- links to