Description
Decision trees in spark.ml (RandomForest.scala) communicate twice as much data as needed for unordered categorical features. Here's an example.
Say there are 3 categories A, B, C. We consider 3 splits:
- A vs. B, C
- A, B vs. C
- A, C vs. B
Currently, we collect statistics for each of the 6 subsets of categories (3 * 2 = 6). However, we could instead collect statistics for the 3 subsets on the left-hand side of the 3 possible splits: A and A,B and A,C. If we also have stats for the entire node, then we can compute the stats for the 3 subsets on the right-hand side of the splits. In pseudomath: stats(B,C) = stats(A,B,C) - stats(A).
We should eliminate these extra bins within the spark.ml implementation since the spark.mllib implementation will be removed before long (and will instead call into spark.ml).
Attachments
Issue Links
- is related to
-
SPARK-3383 DecisionTree aggregate size could be smaller
- Resolved
- links to