[SPARK-8708] MatrixFactorizationModel.predictAll() populates single partition only - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.3.0
Fix Version/s: 1.5.0
Component/s: MLlib
Labels:
None

Target Version/s:

1.5.0

Description

When using mllib.recommendation.ALS the RDD returned by .predictAll() has all values pushed into single partition despite using quite high parallelism.

This degrades performance of further processing (I can obviously run .partitionBy()) to balance it but that's still too costly (ie if running .predictAll() in loop for thousands of products) and should be possible to do it rather somehow on the model (automatically)).

Bellow is an example on tiny sample (same on large dataset):

pyspark

>>> r1 = (1, 1, 1.0)
>>> r2 = (1, 2, 2.0)
>>> r3 = (2, 1, 2.0)
>>> r4 = (2, 2, 2.0)
>>> r5 = (3, 1, 1.0)
>>> ratings = sc.parallelize([r1, r2, r3, r4, r5], 5)
>>> ratings.getNumPartitions()
5
>>> users = ratings.map(itemgetter(0)).distinct()
>>> model = ALS.trainImplicit(ratings, 1, seed=10)
>>> predictions_for_2 = model.predictAll(users.map(lambda u: (u, 2)))
>>> predictions_for_2.glom().map(len).collect()
[0, 0, 3, 0, 0]

Attachments

Issue Links

links to

[Github] Pull Request #7121 (viirya)

Activity

People

Assignee:: L. C. Hsieh

Reporter:: Antony Mayi

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 29/Jun/15 16:38

Updated:: 02/Jul/15 17:22

Resolved:: 02/Jul/15 17:18