Details
Description
In DecisionTree and RandomForest binary classification with ordered categorical features, we order categories' bins based on the hard prediction, but we should use the soft prediction.
Here are the 2 places in mllib and ml:
- https://github.com/apache/spark/blob/45de518742446ddfbd4816c9d0f8501139f9bc2d/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala#L887
- https://github.com/apache/spark/blob/45de518742446ddfbd4816c9d0f8501139f9bc2d/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L779
The PR which fixes this should include a unit test which isolates this issue, ideally by directly calling binsToBestSplit.
Attachments
Issue Links
- links to