Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
None
-
None
Description
RandomForest.binsToBestSplit currently takes a large amount of time. Based on some quick profiling, I believe a big chunk of this is spent in ImpurityAggregator.getCalculator (which seems to make unnecessary Array copies) and RandomForest.calculateImpurityStats.
This JIRA is for:
- Doing more profiling to confirm that unnecessary time is being spent in some of these methods.
- Optimizing the implementation
- Profiling again to confirm the speedups
Local profiling for large enough examples should suffice, especially since the optimizations should not need to change the amount of data communicated.
Attachments
Issue Links
- Is contained by
-
SPARK-14045 DecisionTree improvement umbrella
- Resolved
- relates to
-
SPARK-3383 DecisionTree aggregate size could be smaller
- Resolved
- links to