[SPARK-22751] Improve ML RandomForest shuffle performance - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.2.0
Fix Version/s: 2.4.0
Component/s: ML
Labels:
None

Description

When I try to use ML Randomforest to train a classifier with dataset news20.binary, which has 19,996 training examples and 1,355,191 features, i found that shuffle write size( 51 GB ) of findSplitsBySorting is very large compared with the small data size( 133.52 MB ). I think it is useful to replace groupByKey by reduceByKey to improve shuffle performance.

Attachments

Issue Links

links to

[Github] Pull Request #20472 (lucio-yz)

Activity

People

Assignee:: lucio35

Reporter:: lucio35

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 11/Dec/17 08:11

Updated:: 08/Mar/18 14:04

Resolved:: 08/Mar/18 14:04

Time Tracking

Estimated:

48h

Remaining:

48h

Logged:

Not Specified