Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-13333

DataFrame filter + randn + unionAll has bad interaction

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 1.4.2, 1.6.1, 2.0.0
    • None
    • SQL

    Description

      Buggy workflow

      • Create a DataFrame df0
      • Filter df0
      • Add a randn column
      • Create a copy of the DataFrame
      • unionAll the two DataFrames

      This fails, where randn produces the same results on the original DataFrame and the copy before unionAll but fails to do so after unionAll. Removing the filter fixes the problem.

      The bug can be reproduced on master:

      import org.apache.spark.sql.functions.randn
      
      val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
      
      // Removing the following filter() call makes this give the expected result.
      val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
      println("DF1")
      df1.show()
      
      val df2 = df1.select("id", "b")
      println("DF2")
      df2.show()  // same as df1.show(), as expected
      
      val df3 = df1.unionAll(df2)
      println("DF3")
      df3.show()  // NOT two copies of df1, which is unexpected
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            josephkb Joseph K. Bradley
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: