Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.6.0, 1.7.0, 1.8.0
-
None
-
None
Description
ParquetOutputCommitter only deletes _metadata when fails to write summary files. This may leave inconsistent existing _common_metadata out there.
This issue can be reproduced via the following Spark shell snippet:
import sqlContext.implicits._ val path = "file:///tmp/foo" (0 until 3).map(i => Tuple1((s"a_$i", s"b_$i"))).toDF().coalesce(1).write.mode("overwrite").parquet(path) (0 until 3).map(i => Tuple1((s"a_$i", s"b_$i", s"c_$i"))).toDF().coalesce(1).write.mode("append").parquet(path)
The 2nd write job fails to write the summary file because two written Parquet files contain different user-defined metadata (Spark SQL schema). We can find out that there is an _common_metadata left there:
$ tree /tmp/foo /tmp/foo ├── _SUCCESS ├── _common_metadata ├── part-r-00000-1c8bcb7f-84cf-43e3-9cd6-04d371322d95.gz.parquet └── part-r-00000-d759c53f-d12f-4555-9b27-8b03a8343b17.gz.parquet
Check its schema, the nested group contains only 2 fields, which is wrong:
$ parquet-schema /tmp/foo/_common_metadata message root { optional group _1 { optional binary _1 (UTF8); optional binary _2 (UTF8); } }
Attachments
Issue Links
- depends upon
-
PARQUET-381 It should be possible to merge summary files, and control which files are generated
- Resolved
- links to