Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-10428

Struct fields read from parquet are mis-aligned

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 1.5.0
    • Fix Version/s: None
    • Component/s: SQL
    • Labels:
      None

      Description

      val df1 = sqlContext
              .range(1)
              .selectExpr("NAMED_STRUCT('a', id, 'd', id + 3) AS s")
              .coalesce(1)
      
      val df2 = sqlContext
        .range(1, 2)
        .selectExpr("NAMED_STRUCT('a', id, 'b', id + 1, 'c', id + 2, 'd', id + 3) AS s")
        .coalesce(1)
      
      df1.write.mode("overwrite").parquet("/home/yin/sc_11_minimal/p=1")
      df2.write.mode("overwrite").parquet("/home/yin/sc_11_minimal/p=2")
      
      sqlContext.read.option("mergeSchema", "true").parquet("/home/yin/sc_11_minimal/").selectExpr("s.a", "s.b", "s.c", "s.d", “p").show
      
      +---+---+----+----+---+
      |  a|  b|   c|   d|  p|
      +---+---+----+----+---+
      |  0|  3|null|null|  1|
      |  1|  2|   3|   4|  2|
      +---+---+----+----+---+
      

      Looks like the problem is at https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystRowConverter.scala#L185-L204, we do padding when global schema has more struct fields than local parquet file's schema. However, when we read field from parquet, we still use parquet's local schema and then we put the value of d to the wrong slot.

      I tried master. Looks like this issue is resolved by https://github.com/apache/spark/pull/8509. We need to decide if we want to back port that to branch 1.5.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                yhuai Yin Huai
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: