[SPARK-10428] Struct fields read from parquet are mis-aligned - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.5.0
Fix Version/s: None
Component/s: SQL
Labels:
None

Target Version/s:

1.5.0, 1.5.1

Description

val df1 = sqlContext
        .range(1)
        .selectExpr("NAMED_STRUCT('a', id, 'd', id + 3) AS s")
        .coalesce(1)

val df2 = sqlContext
  .range(1, 2)
  .selectExpr("NAMED_STRUCT('a', id, 'b', id + 1, 'c', id + 2, 'd', id + 3) AS s")
  .coalesce(1)

df1.write.mode("overwrite").parquet("/home/yin/sc_11_minimal/p=1")
df2.write.mode("overwrite").parquet("/home/yin/sc_11_minimal/p=2")

sqlContext.read.option("mergeSchema", "true").parquet("/home/yin/sc_11_minimal/").selectExpr("s.a", "s.b", "s.c", "s.d", “p").show

+---+---+----+----+---+
|  a|  b|   c|   d|  p|
+---+---+----+----+---+
|  0|  3|null|null|  1|
|  1|  2|   3|   4|  2|
+---+---+----+----+---+

Looks like the problem is at https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystRowConverter.scala#L185-L204, we do padding when global schema has more struct fields than local parquet file's schema. However, when we read field from parquet, we still use parquet's local schema and then we put the value of d to the wrong slot.

I tried master. Looks like this issue is resolved by https://github.com/apache/spark/pull/8509. We need to decide if we want to back port that to branch 1.5.

Attachments

Issue Links

duplicates

SPARK-10301 For struct type, if parquet's global schema has less fields than a file's schema, data reading will fail

Resolved

links to

[Github] Pull Request #8583 (liancheng)

[Github] Pull Request #8670 (liancheng)

Activity

People

Assignee:: Unassigned

Reporter:: Yin Huai

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 03/Sep/15 03:00

Updated:: 09/Sep/15 14:36

Resolved:: 08/Sep/15 22:18