Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-10434

Parquet compatibility with 1.4 is broken when writing arrays that may contain nulls

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.5.0
    • Fix Version/s: 1.5.1, 1.6.0
    • Component/s: SQL
    • Labels:
      None

      Description

      When writing arrays that may contain nulls, for example:

      StructType(
        StructField(
          "f",
          ArrayType(IntegerType, containsNull = true),
          nullable = false))
      

      Spark 1.4 uses the following schema:

      message m {
        required group f (LIST) {
          repeated group bag {
            optional int32 array;
          }
        }
      }
      

      This behavior is a hybrid of parquet-avro and parquet-hive: the 3-level structure and repeated group name "bag" are borrowed from parquet-hive, while the innermost element field name "array" is borrowed from parquet-avro.

      However, in Spark 1.5, I failed to notice the latter fact and used a schema in purely parquet-hive flavor, namely:

      message m {
        required group f (LIST) {
          repeated group bag {
            optional int32 array_element;
          }
        }
      }
      

      One of the direct consequence is that, Parquet files containing such array fields written by Spark 1.5 can't be read by Spark 1.4 (all array elements become null).

      To fix this issue, the name of the innermost field should be changed back to "array". Notice that this fix doesn't affect interoperability with Hive (saving Parquet files using saveAsTable() and then read them using Hive).

        Attachments

          Activity

            People

            • Assignee:
              lian cheng Cheng Lian
              Reporter:
              lian cheng Cheng Lian
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: