Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-12662

Add a local sort operator to DataFrame used by randomSplit

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.6.1, 2.0.0
    • Component/s: Documentation, SQL
    • Labels:
      None

      Description

      With ./bin/spark-shell --master=local-cluster[2,1,2014], the following code will provide overlapped rows for two DFs returned by the randomSplit.

      sqlContext.sql("drop table if exists test")
      val x = sc.parallelize(1 to 210)
      case class R(ID : Int)
      sqlContext.createDataFrame(x.map {R(_)}).write.format("json").saveAsTable("bugsc1597")
      var df = sql("select distinct ID from test")
      var Array(a, b) = df.randomSplit(Array(0.333, 0.667), 1234L)
      a.registerTempTable("a")
      b.registerTempTable("b")
      val intersectDF = a.intersect(b)
      intersectDF.show
      

      The reason is that {{sql("select distinct ID from test")} does not guarantee the ordering rows in a partition. It will be good to add a local sort operator to make row ordering within a partition deterministic.

        Attachments

          Activity

            People

            • Assignee:
              sameerag Sameer Agarwal
              Reporter:
              yhuai Yin Huai
            • Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: