[PARQUET-359] Existing _common_metadata should be deleted when ParquetOutputCommitter fails to write summary files - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.6.0, 1.7.0, 1.8.0
Fix Version/s: None
Component/s: parquet-mr
Labels:
None

Description

ParquetOutputCommitter only deletes _metadata when fails to write summary files. This may leave inconsistent existing _common_metadata out there.

This issue can be reproduced via the following Spark shell snippet:

import sqlContext.implicits._

val path = "file:///tmp/foo"
(0 until 3).map(i => Tuple1((s"a_$i", s"b_$i"))).toDF().coalesce(1).write.mode("overwrite").parquet(path)
(0 until 3).map(i => Tuple1((s"a_$i", s"b_$i", s"c_$i"))).toDF().coalesce(1).write.mode("append").parquet(path)

The 2nd write job fails to write the summary file because two written Parquet files contain different user-defined metadata (Spark SQL schema). We can find out that there is an _common_metadata left there:

$ tree /tmp/foo
/tmp/foo
├── _SUCCESS
├── _common_metadata
├── part-r-00000-1c8bcb7f-84cf-43e3-9cd6-04d371322d95.gz.parquet
└── part-r-00000-d759c53f-d12f-4555-9b27-8b03a8343b17.gz.parquet

Check its schema, the nested group contains only 2 fields, which is wrong:

$ parquet-schema /tmp/foo/_common_metadata
message root {
  optional group _1 {
    optional binary _1 (UTF8);
    optional binary _2 (UTF8);
  }
}

Attachments

Issue Links

depends upon

PARQUET-381 It should be possible to merge summary files, and control which files are generated

Resolved

links to

PR #258

Activity

People

Assignee:: Cheng Lian

Reporter:: Cheng Lian

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 16/Aug/15 08:24

Updated:: 23/Jun/24 03:27

Resolved:: 21/Nov/15 01:05