Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Information Provided
-
2.4.5
-
None
-
None
Description
My program logs says it uses ParquetOutputCommitter when I use "Parquet" even after I configure to use "PartitionedStagingCommitter" with the following configuration:
- sparkSession.conf().set("spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a", "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory");
- sparkSession.conf().set("fs.s3a.committer.name", "partitioned");
- sparkSession.conf().set("fs.s3a.committer.staging.conflict-mode", "append");
- sparkSession.conf().set("spark.hadoop.parquet.mergeSchema", "false");
- sparkSession.conf().set("spark.hadoop.parquet.enable.summary-metadata", false);
Application logs stacktrace:
20/03/06 10:15:17 INFO ParquetFileFormat: Using default output committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm version is 2
20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using user defined output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm version is 2
20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
But when I use ORC as the file format, with the same configuration as above it correctly pick "PartitionedStagingCommitter":
20/03/05 11:51:14 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
20/03/05 11:51:14 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using committer partitioned to output data to s3a:************
20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using Commmitter PartitionedStagingCommitter**********
So I am wondering why Parquet and ORC has different behavior ?
How can I use PartitionedStagingCommitter instead of ParquetOutputCommitter?
I started this because when I was trying to save data to S3 directly with partitionBy() two columns - I was getting file not found exceptions intermittently.
So how could I avoid this issue with Parquet using Spark to S3 using s3A without s3aGuard?