Description
Improvement: test-time computation
Currently, pairs of leaf nodes with the same parent can both output the same prediction. This happens since the splitting criterion (e.g., Gini) is not the same as prediction accuracy/MSE; the splitting criterion can sometimes be improved even when both children would still output the same prediction (e.g., based on the majority label for classification).
We could check the tree and reduce it if possible after training.
Note: This happens with scikit-learn as well.
Attachments
Attachments
Issue Links
- causes
-
SPARK-34591 Pyspark undertakes pruning of decision trees and random forests outside the control of the user, leading to undesirable and unexpected outcomes that are challenging to diagnose and impossible to correct
- In Progress
- Is contained by
-
SPARK-14045 DecisionTree improvement umbrella
- Resolved
- is duplicated by
-
SPARK-23409 RandomForest/DecisionTree (syntactic) pruning of redundant subtrees
- Resolved
- links to