Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16344

Array of struct with a single field name "element" can't be decoded from Parquet files written by Spark 1.6+

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.6.0, 1.6.1, 1.6.2, 2.0.0
    • 2.1.0
    • SQL
    • None

    Description

      This is a weird corner case. Users may hit this issue if they have a schema that

      1. has an array field whose element type is a struct, and
      2. the struct has one and only one field, and
      3. that field is named as "element".

      The following Spark shell snippet for Spark 1.6 reproduces this bug:

      case class A(element: Long)
      case class B(f: Array[A])
      
      val path = "/tmp/silly.parquet"
      Seq(B(Array(A(42)))).toDF("f0").write.mode("overwrite").parquet(path)
      
      val df = sqlContext.read.parquet(path)
      df.printSchema()
      // root
      //  |-- f0: array (nullable = true)
      //  |    |-- element: struct (containsNull = true)
      //  |    |    |-- element: long (nullable = true)
      
      df.show()
      

      Exception thrown:

      org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file file:/tmp/silly.parquet/part-r-00007-e06db7b0-5181-4a14-9fee-5bb452e883a0.gz.parquet
              at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
              at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
              at org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:194)
              at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
              at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
              at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
              at scala.collection.Iterator$class.foreach(Iterator.scala:727)
              at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
              at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
              at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
              at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
              at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
              at scala.collection.AbstractIterator.to(Iterator.scala:1157)
              at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
              at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
              at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
              at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
              at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
              at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
              at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
              at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
              at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
              at org.apache.spark.scheduler.Task.run(Task.scala:89)
              at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
              at java.lang.Thread.run(Thread.java:745)
      Caused by: java.lang.ClassCastException: Expected instance of group converter but got "org.apache.spark.sql.execution.datasources.parquet.CatalystPrimitiveConverter"
              at org.apache.parquet.io.api.Converter.asGroupConverter(Converter.java:37)
              at org.apache.parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:266)
              at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:134)
              at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:99)
              at org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)
              at org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:99)
              at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137)
              at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208)
              ... 26 more
      

      Spark 2.0.0-SNAPSHOT and Spark master also suffer this issue. To reproduce it using these versions, just replace sqlContext in the above snippet with spark.

      The reason behind is related to Parquet backwards-compatibility rules for LIST types defined in parquet-format spec.

      The Spark SQL schema shown above

      root
       |-- f0: array (nullable = true)
       |    |-- element: struct (containsNull = true)
       |    |    |-- element: long (nullable = true)
      

      is equivalent to the following SQL type:

      STRUCT<
        f: ARRAY<
          STRUCT<element: BIGINT>
        >
      >
      

      According to the parquet-format spec, the standard layout of a LIST-like structure is a 3-level layout:

      <list-repetition> group <name> (LIST) {
        repeated group list {
          <element-repetition> <element-type> element;
        }
      }
      

      Thus, the standard representation of the aforementioned SQL type should be:

      message root {
        optional group f (LIST) {
          repeated group list {
            optional group element {    (1)
              optional int64 element;   (2)
            }
          }
        }
      }
      

      Note that the two "element" fields are different:

      • The group field "element" at (1) is a "container" of list element type. This is defined as part of the parquet-format spec.
      • The int64 field "element" at (2) corresponds to the element field of case class A we defined above.

      However, due to historical reasons, various existing systems do not conform to the parquet-format spec and may write LIST structures in a non-standard layout. For example, parquet-avro and parquet-thrift use a 2-level layout like

      // parquet-avro style
      <list-repetition> group <name> (LIST) {
        repeated <element-type> array;
      }
      
      // parquet-thrift style
      <list-repetition> group <name> (LIST) {
        repeated <element-type> <name>_tuple;
      }
      

      To keep backwards-compatibility, the parquet-format spec defined a set of backwards-compatibility rules to also recognize these patterns.

      Unfortunately, these backwards-compatibility rules makes the Parquet schema we mentioned above ambiguous:

      message root {
        optional group f (LIST) {
          repeated group list {
            optional group element {
              optional int64 element;
            }
          }
        }
      }
      

      When interpreted using the standard 3-level layout, it is the expected type:

      STRUCT<
        f: ARRAY<
          STRUCT<element: BIGINT>
        >
      >
      

      When interpreted using the legacy 2-level layout, it is the unexpected type

      // When interpreted as legacy 2-level layout
      STRUCT<
        f: ARRAY<
          STRUCT<element: STRUCT<element: BIGINT>>
        >
      >
      

      This is because the nested struct field name happens to be "element", which is used as a dedicated name of the element type "container" group in the standard 3-level layout, and lead to the ambiguity.

      Currently, Spark 1.6.x, 2.0.0-SNAPSHOT, and master chose the 2nd one. We can fix this issue by giving the standard 3-level layout a higher priority when trying to match schema patterns.

      Attachments

        Issue Links

          Activity

            People

              lian cheng Cheng Lian
              lian cheng Cheng Lian
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: