Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
2.2.0
-
None
Description
Do not need generate all possible splits for unordered features before aggregate,
in aggregete (executor side):
1. Change `mixedBinSeqOp`, for each unordered feature, we do the same stat with ordered features. so for unordered features, we only need O(numCategories) space for this feature stat.
2. After driver side get the aggregate result, generate all possible split combinations, and compute the best split.
This will reduce decision tree aggregate size for each unordered feature from O(2^numCategories) to O(numCategories), `numCategories` is the arity of this unordered feature.
This also reduce the cpu cost in executor side. Reduce time complexity for this unordered feature from O(numPoints * 2^numCategories) to O(numPoints).
This won't increase time complexity for unordered features best split computing in driver side.
Attachments
Issue Links
- Is contained by
-
SPARK-14045 DecisionTree improvement umbrella
- Resolved
- relates to
-
SPARK-3383 DecisionTree aggregate size could be smaller
- Resolved
- links to