[SPARK-12590] Inconsistent behavior of randomSplit in YARN mode - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Not A Problem
Affects Version/s: 1.5.2
Fix Version/s: None
Component/s: MLlib, Spark Core
Labels:
None
Environment:

YARN mode

Description

I noticed an inconsistent behavior when using rdd.randomSplit when the source rdd is repartitioned, but only in YARN mode. It works fine in local mode though.

Code:
val rdd = sc.parallelize(1 to 1000000)
val rdd2 = rdd.repartition(64)
rdd.partitions.size
rdd2.partitions.size
val Array(train, test) = rdd2.randomSplit(Array(70, 30), 1)
train.takeOrdered(10)
test.takeOrdered(10)

Master: local
Both the take statements produce consistent results and have no overlap in numbers being outputted.

Master: YARN
However, when these are run on YARN mode, these produce random results every time and also the train and test have overlap in the numbers being outputted.
If I use rdd.randomSplit, then it works fine even on YARN.

So, it concludes that the repartition is being evaluated every time the splitting occurs.

Interestingly, if I cache the rdd2 before splitting it, then we can expect consistent behavior since repartition is not evaluated again and again.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Gaurav Kumar

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 31/Dec/15 06:09

Updated:: 31/Dec/15 11:24

Resolved:: 31/Dec/15 09:55