[SPARK-2612] ALS has data skew for popular product - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.0.0
Fix Version/s: 1.1.0
Component/s: MLlib
Labels:
None

Description

Usually there are some popular products which are related with many users in Rating inputs.
groupByKey() in updateFeatures() may cause one extra Shuffle stage to gather data of the popular product to one task, because it's RDD's partitioner may be not used as the join() partitioner.
The following join() need to shuffle from the aggregated product data. The shuffle block can easily be bigger than 2G, and shuffle failed as mentioned in ~~SPARK-1476~~
And increasing blocks number doesn't work.

IMHO, groupByKey() should use the same partitioner as the other RDD in join(). So groupByKey() and join() will be in the same stage, and shuffle data from many previous tasks will not trigger "2G" limits.

Attachments

Issue Links

links to

[Github] Pull Request #1521 (renozhang)

Activity

People

Assignee:: Peng Zhang

Reporter:: Peng Zhang

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 22/Jul/14 05:51

Updated:: 22/Jul/14 09:41

Resolved:: 22/Jul/14 09:41