Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20761

Union uses column order rather than schema

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Duplicate
    • 2.1.1
    • None
    • SQL
    • None

    Description

      I believe there is an issue when using union to combine two dataframes when the order of columns differ between the left and right side of the union:

      import org.apache.spark.sql.{Row, SparkSession}
      import org.apache.spark.sql.types.{BooleanType, StringType, StructField, StructType}
      
      val schema = StructType(Seq(
        StructField("id", StringType, false),
        StructField("flag_one", BooleanType, false),
        StructField("flag_two", BooleanType, false),
        StructField("flag_three", BooleanType, false)
      ))
      
      val rowRdd = spark.sparkContext.parallelize(Seq(
        Row("1", true, false, false),
        Row("2", false, true, false),
        Row("3", false, false, true)
      ))
      
      spark.createDataFrame(rowRdd, schema).createOrReplaceTempView("temp_flags")
      
      val emptyData = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema)
      
      //Select columns out of order with respect to the emptyData schema
      val data = emptyData.union(spark.sql("select id, flag_two, flag_three, flag_one from temp_flags"))
      

      Selecting the data from the "temp_flags" table results in:

      spark.sql("select * from temp_flags").show
      +---+--------+--------+----------+
      | id|flag_one|flag_two|flag_three|
      +---+--------+--------+----------+
      |  1|    true|   false|     false|
      |  2|   false|    true|     false|
      |  3|   false|   false|      true|
      +---+--------+--------+----------+
      

      Which is the data we'd expect but when inspecting "data" we get:

      data.show()
      +---+--------+--------+----------+
      | id|flag_one|flag_two|flag_three|
      +---+--------+--------+----------+
      |  1|   false|   false|      true|
      |  2|    true|   false|     false|
      |  3|   false|    true|     false|
      +---+--------+--------+----------+
      

      Having a non-empty dataframe on the left side of the union doesn't seem to make a difference either:

      spark.sql("select * from temp_flags").union(spark.sql("select id, flag_two, flag_three, flag_one from temp_flags")).show
      +---+--------+--------+----------+
      | id|flag_one|flag_two|flag_three|
      +---+--------+--------+----------+
      |  1|    true|   false|     false|
      |  2|   false|    true|     false|
      |  3|   false|   false|      true|
      |  1|   false|   false|      true|
      |  2|    true|   false|     false|
      |  3|   false|    true|     false|
      +---+--------+--------+----------+
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              njeirath Nakul Jeirath
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: