Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-359

Existing _common_metadata should be deleted when ParquetOutputCommitter fails to write summary files

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.6.0, 1.7.0, 1.8.0
    • None
    • parquet-mr
    • None

    Description

      ParquetOutputCommitter only deletes _metadata when fails to write summary files. This may leave inconsistent existing _common_metadata out there.

      This issue can be reproduced via the following Spark shell snippet:

      import sqlContext.implicits._
      
      val path = "file:///tmp/foo"
      (0 until 3).map(i => Tuple1((s"a_$i", s"b_$i"))).toDF().coalesce(1).write.mode("overwrite").parquet(path)
      (0 until 3).map(i => Tuple1((s"a_$i", s"b_$i", s"c_$i"))).toDF().coalesce(1).write.mode("append").parquet(path)
      

      The 2nd write job fails to write the summary file because two written Parquet files contain different user-defined metadata (Spark SQL schema). We can find out that there is an _common_metadata left there:

      $ tree /tmp/foo
      /tmp/foo
      ├── _SUCCESS
      ├── _common_metadata
      ├── part-r-00000-1c8bcb7f-84cf-43e3-9cd6-04d371322d95.gz.parquet
      └── part-r-00000-d759c53f-d12f-4555-9b27-8b03a8343b17.gz.parquet
      

      Check its schema, the nested group contains only 2 fields, which is wrong:

      $ parquet-schema /tmp/foo/_common_metadata
      message root {
        optional group _1 {
          optional binary _1 (UTF8);
          optional binary _2 (UTF8);
        }
      }
      

      Attachments

        Issue Links

          Activity

            People

              lian cheng Cheng Lian
              lian cheng Cheng Lian
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: