Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Duplicate
-
2.1.1
-
None
-
None
Description
I believe there is an issue when using union to combine two dataframes when the order of columns differ between the left and right side of the union:
import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.types.{BooleanType, StringType, StructField, StructType} val schema = StructType(Seq( StructField("id", StringType, false), StructField("flag_one", BooleanType, false), StructField("flag_two", BooleanType, false), StructField("flag_three", BooleanType, false) )) val rowRdd = spark.sparkContext.parallelize(Seq( Row("1", true, false, false), Row("2", false, true, false), Row("3", false, false, true) )) spark.createDataFrame(rowRdd, schema).createOrReplaceTempView("temp_flags") val emptyData = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema) //Select columns out of order with respect to the emptyData schema val data = emptyData.union(spark.sql("select id, flag_two, flag_three, flag_one from temp_flags"))
Selecting the data from the "temp_flags" table results in:
spark.sql("select * from temp_flags").show +---+--------+--------+----------+ | id|flag_one|flag_two|flag_three| +---+--------+--------+----------+ | 1| true| false| false| | 2| false| true| false| | 3| false| false| true| +---+--------+--------+----------+
Which is the data we'd expect but when inspecting "data" we get:
data.show() +---+--------+--------+----------+ | id|flag_one|flag_two|flag_three| +---+--------+--------+----------+ | 1| false| false| true| | 2| true| false| false| | 3| false| true| false| +---+--------+--------+----------+
Having a non-empty dataframe on the left side of the union doesn't seem to make a difference either:
spark.sql("select * from temp_flags").union(spark.sql("select id, flag_two, flag_three, flag_one from temp_flags")).show +---+--------+--------+----------+ | id|flag_one|flag_two|flag_three| +---+--------+--------+----------+ | 1| true| false| false| | 2| false| true| false| | 3| false| false| true| | 1| false| false| true| | 2| true| false| false| | 3| false| true| false| +---+--------+--------+----------+
Attachments
Issue Links
- duplicates
-
SPARK-15918 unionAll returns wrong result when two dataframes has schema in different order
- Resolved
- relates to
-
SPARK-22335 Union for DataSet uses column order instead of types for union
- Resolved