I noticed an inconsistent behavior when using rdd.randomSplit when the source rdd is repartitioned, but only in YARN mode. It works fine in local mode though.
val rdd = sc.parallelize(1 to 1000000)
val rdd2 = rdd.repartition(64)
val Array(train, test) = rdd2.randomSplit(Array(70, 30), 1)
Both the take statements produce consistent results and have no overlap in numbers being outputted.
However, when these are run on YARN mode, these produce random results every time and also the train and test have overlap in the numbers being outputted.
If I use rdd.randomSplit, then it works fine even on YARN.
So, it concludes that the repartition is being evaluated every time the splitting occurs.
Interestingly, if I cache the rdd2 before splitting it, then we can expect consistent behavior since repartition is not evaluated again and again.