[SPARK-22096] use aggregateByKeyLocally to save one stage in calculating ItemFrequency in NaiveBayes - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Incomplete
Affects Version/s: 2.2.0
Fix Version/s: None
Component/s: ML
Labels:
- bulk-closed

Description

NaiveBayes currently takes aggreateByKey followed by a collect to calculate frequency for each feature/label. We can implement a new function 'aggregateByKeyLocally' in RDD that merges locally on each mapper before sending results to a reducer to save one stage.
We tested on NaiveBayes and see ~16% performance gain with these changes.
performance data for NB.png

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

performance data for NB.png
22/Sep/17 04:34
12 kB
Vincent

Issue Links

depends upon

SPARK-22098 Add aggregateByKeyLocally in RDD

Resolved

links to

[Github] Pull Request #19318 (VinceShieh)

GitHub Pull Request #19318

Sub-Tasks

Add aggregateByKeyLocally in RDD

Resolved

Unassigned

Activity

People

Assignee:: Unassigned

Reporter:: Vincent

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 22/Sep/17 03:51

Updated:: 24/Jun/19 08:47

Resolved:: 21/May/19 04:14