[SPARK-14351] Optimize ImpurityAggregator for decision trees - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: None
Fix Version/s: None
Component/s: ML
Labels:
- bulk-closed

Description

RandomForest.binsToBestSplit currently takes a large amount of time. Based on some quick profiling, I believe a big chunk of this is spent in ImpurityAggregator.getCalculator (which seems to make unnecessary Array copies) and RandomForest.calculateImpurityStats.

This JIRA is for:

Doing more profiling to confirm that unnecessary time is being spent in some of these methods.
Optimizing the implementation
Profiling again to confirm the speedups

Local profiling for large enough examples should suffice, especially since the optimizations should not need to change the amount of data communicated.

Attachments

Issue Links

Is contained by

SPARK-14045 DecisionTree improvement umbrella

Resolved

relates to

SPARK-3383 DecisionTree aggregate size could be smaller

Resolved

links to

[Github] Pull Request #13959 (MechCoder)

Activity

People

Assignee:: Unassigned

Reporter:: Joseph K. Bradley

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 03/Apr/16 04:33

Updated:: 21/May/19 04:14

Resolved:: 21/May/19 04:14