[SPARK-12662] Add a local sort operator to DataFrame used by randomSplit - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.6.1, 2.0.0
Component/s: Documentation, SQL
Labels:
None

Target Version/s:

1.6.1, 2.0.0

Description

With ./bin/spark-shell --master=local-cluster[2,1,2014], the following code will provide overlapped rows for two DFs returned by the randomSplit.

sqlContext.sql("drop table if exists test")
val x = sc.parallelize(1 to 210)
case class R(ID : Int)
sqlContext.createDataFrame(x.map {R(_)}).write.format("json").saveAsTable("bugsc1597")
var df = sql("select distinct ID from test")
var Array(a, b) = df.randomSplit(Array(0.333, 0.667), 1234L)
a.registerTempTable("a")
b.registerTempTable("b")
val intersectDF = a.intersect(b)
intersectDF.show

The reason is that {{sql("select distinct ID from test")} does not guarantee the ordering rows in a partition. It will be good to add a local sort operator to make row ordering within a partition deterministic.

Attachments

Issue Links

links to

[Github] Pull Request #10626 (sameeragarwal)

Activity

People

Assignee:: Sameer Agarwal

Reporter:: Yin Huai

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 06/Jan/16 00:02

Updated:: 07/Jan/16 18:37

Resolved:: 07/Jan/16 18:37