[SPARK-22217] ParquetFileFormat to support arbitrary OutputCommitters - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.2.0
Fix Version/s: 2.2.1, 2.3.0
Component/s: SQL
Labels:
None

Target Version/s:

2.3.0

Description

Although you can choose which committer to write dataframes as parquet data via spark.sql.parquet.output.committer.class, you get a class cast exception if this is not a org.apache.parquet.hadoop.ParquetOutputCommitter or subclass.

This is not consistent with the docs in SQLConf, which says

The specified class needs to be a subclass of org.apache.hadoop.mapreduce.OutputCommitter. Typically, it's also a subclass of org.apache.parquet.hadoop.ParquetOutputCommitter.

It is simple to relax ParquetFileFormat's requirements, though if the user has set
parquet.enable.summary-metadata=true, and set a committer which is not a ParquetOutputCommitter, then they won't see the data.

Attachments

Issue Links

is related to

HADOOP-13786 Add S3A committers for zero-rename commits to S3 endpoints

Resolved

links to

[Github] Pull Request #19448 (steveloughran)

Activity

People

Assignee:: Steve Loughran

Reporter:: Steve Loughran

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 06/Oct/17 15:56

Updated:: 12/Dec/22 18:11

Resolved:: 12/Oct/17 23:41