Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-12662

Add a local sort operator to DataFrame used by randomSplit

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.6.1, 2.0.0
    • Documentation, SQL
    • None

    Description

      With ./bin/spark-shell --master=local-cluster[2,1,2014], the following code will provide overlapped rows for two DFs returned by the randomSplit.

      sqlContext.sql("drop table if exists test")
      val x = sc.parallelize(1 to 210)
      case class R(ID : Int)
      sqlContext.createDataFrame(x.map {R(_)}).write.format("json").saveAsTable("bugsc1597")
      var df = sql("select distinct ID from test")
      var Array(a, b) = df.randomSplit(Array(0.333, 0.667), 1234L)
      a.registerTempTable("a")
      b.registerTempTable("b")
      val intersectDF = a.intersect(b)
      intersectDF.show
      

      The reason is that {{sql("select distinct ID from test")} does not guarantee the ordering rows in a partition. It will be good to add a local sort operator to make row ordering within a partition deterministic.

      Attachments

        Activity

          People

            sameerag Sameer Agarwal
            yhuai Yin Huai
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: