Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Incomplete
-
2.2.0
-
None
Description
NaiveBayes currently takes aggreateByKey followed by a collect to calculate frequency for each feature/label. We can implement a new function 'aggregateByKeyLocally' in RDD that merges locally on each mapper before sending results to a reducer to save one stage.
We tested on NaiveBayes and see ~16% performance gain with these changes.
performance data for NB.png
Attachments
Attachments
Issue Links
- depends upon
-
SPARK-22098 Add aggregateByKeyLocally in RDD
- Resolved
- links to
1.
|
Add aggregateByKeyLocally in RDD | Resolved | Unassigned |