Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-10428

Struct fields read from parquet are mis-aligned

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 1.5.0
    • None
    • SQL
    • None

    Description

      val df1 = sqlContext
              .range(1)
              .selectExpr("NAMED_STRUCT('a', id, 'd', id + 3) AS s")
              .coalesce(1)
      
      val df2 = sqlContext
        .range(1, 2)
        .selectExpr("NAMED_STRUCT('a', id, 'b', id + 1, 'c', id + 2, 'd', id + 3) AS s")
        .coalesce(1)
      
      df1.write.mode("overwrite").parquet("/home/yin/sc_11_minimal/p=1")
      df2.write.mode("overwrite").parquet("/home/yin/sc_11_minimal/p=2")
      
      sqlContext.read.option("mergeSchema", "true").parquet("/home/yin/sc_11_minimal/").selectExpr("s.a", "s.b", "s.c", "s.d", “p").show
      
      +---+---+----+----+---+
      |  a|  b|   c|   d|  p|
      +---+---+----+----+---+
      |  0|  3|null|null|  1|
      |  1|  2|   3|   4|  2|
      +---+---+----+----+---+
      

      Looks like the problem is at https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystRowConverter.scala#L185-L204, we do padding when global schema has more struct fields than local parquet file's schema. However, when we read field from parquet, we still use parquet's local schema and then we put the value of d to the wrong slot.

      I tried master. Looks like this issue is resolved by https://github.com/apache/spark/pull/8509. We need to decide if we want to back port that to branch 1.5.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              yhuai Yin Huai
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: