Details
Description
With ./bin/spark-shell --master=local-cluster[2,1,2014], the following code will provide overlapped rows for two DFs returned by the randomSplit.
sqlContext.sql("drop table if exists test") val x = sc.parallelize(1 to 210) case class R(ID : Int) sqlContext.createDataFrame(x.map {R(_)}).write.format("json").saveAsTable("bugsc1597") var df = sql("select distinct ID from test") var Array(a, b) = df.randomSplit(Array(0.333, 0.667), 1234L) a.registerTempTable("a") b.registerTempTable("b") val intersectDF = a.intersect(b) intersectDF.show
The reason is that {{sql("select distinct ID from test")} does not guarantee the ordering rows in a partition. It will be good to add a local sort operator to make row ordering within a partition deterministic.