Description
val df1 = sqlContext .range(1) .selectExpr("NAMED_STRUCT('a', id, 'd', id + 3) AS s") .coalesce(1) val df2 = sqlContext .range(1, 2) .selectExpr("NAMED_STRUCT('a', id, 'b', id + 1, 'c', id + 2, 'd', id + 3) AS s") .coalesce(1) df1.write.mode("overwrite").parquet("/home/yin/sc_11_minimal/p=1") df2.write.mode("overwrite").parquet("/home/yin/sc_11_minimal/p=2")
sqlContext.read.option("mergeSchema", "true").parquet("/home/yin/sc_11_minimal/").selectExpr("s.a", "s.b", "s.c", "s.d", “p").show +---+---+----+----+---+ | a| b| c| d| p| +---+---+----+----+---+ | 0| 3|null|null| 1| | 1| 2| 3| 4| 2| +---+---+----+----+---+
Looks like the problem is at https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystRowConverter.scala#L185-L204, we do padding when global schema has more struct fields than local parquet file's schema. However, when we read field from parquet, we still use parquet's local schema and then we put the value of d to the wrong slot.
I tried master. Looks like this issue is resolved by https://github.com/apache/spark/pull/8509. We need to decide if we want to back port that to branch 1.5.
Attachments
Issue Links
- duplicates
-
SPARK-10301 For struct type, if parquet's global schema has less fields than a file's schema, data reading will fail
- Resolved
- links to