Description
We should be using weighted split points rather than the actual continuous binned feature values. For instance, in a dataset containing binary features (that are fed in as continuous ones), our splits are selected as x <= 0.0 and x > 0.0. For any real data with some smoothness qualities, this is asymptotically bad compared to GBM's approach. The split point should be a weighted split point of the two values of the "innermost" feature bins; e.g., if there are 30 x = 0 and 10 x = 1, the above split should be at 0.75.
Example:
+--------+--------+-----+-----+ |feature0|feature1|label|count| +--------+--------+-----+-----+ | 0.0| 0.0| 0.0| 23| | 1.0| 0.0| 0.0| 2| | 0.0| 0.0| 1.0| 2| | 0.0| 1.0| 0.0| 7| | 1.0| 0.0| 1.0| 23| | 0.0| 1.0| 1.0| 18| | 1.0| 1.0| 1.0| 7| | 1.0| 1.0| 0.0| 18| +--------+--------+-----+-----+ DecisionTreeRegressionModel (uid=dtr_01ae90d489b1) of depth 2 with 7 nodes If (feature 0 <= 0.0) If (feature 1 <= 0.0) Predict: -0.56 Else (feature 1 > 0.0) Predict: 0.29333333333333333 Else (feature 0 > 0.0) If (feature 1 <= 0.0) Predict: 0.56 Else (feature 1 > 0.0) Predict: -0.29333333333333333
Attachments
Issue Links
- Is contained by
-
SPARK-14045 DecisionTree improvement umbrella
- Resolved
- links to