Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-31072

Default to ParquetOutputCommitter even after configuring s3a committer as "partitioned"

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Information Provided
    • 2.4.5
    • None
    • Java API
    • None

    Description

      My program logs says it uses ParquetOutputCommitter when I use "Parquet" even after I configure to use "PartitionedStagingCommitter" with the following configuration:

      • sparkSession.conf().set("spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a", "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory");
      • sparkSession.conf().set("fs.s3a.committer.name", "partitioned");
      • sparkSession.conf().set("fs.s3a.committer.staging.conflict-mode", "append");
      • sparkSession.conf().set("spark.hadoop.parquet.mergeSchema", "false");
      • sparkSession.conf().set("spark.hadoop.parquet.enable.summary-metadata", false);

      Application logs stacktrace:

      20/03/06 10:15:17 INFO ParquetFileFormat: Using default output committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
      20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm version is 2
      20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
      20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using user defined output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
      20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm version is 2
      20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
      20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.parquet.hadoop.ParquetOutputCommitter

      But when I use ORC as the file format, with the same configuration as above it correctly pick "PartitionedStagingCommitter":
      20/03/05 11:51:14 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
      20/03/05 11:51:14 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
      20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using committer partitioned to output data to s3a:************
      20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using Commmitter PartitionedStagingCommitter**********

      So I am wondering why Parquet and ORC has different behavior ?
      How can I use PartitionedStagingCommitter instead of ParquetOutputCommitter?

      I started this because when I was trying to save data to S3 directly with partitionBy() two columns -  I was getting  file not found exceptions intermittently. 
      So how could I avoid this issue with Parquet  using Spark to S3 using s3A without s3aGuard?

      Attachments

        Activity

          People

            Unassigned Unassigned
            FelixKJose Felix Kizhakkel Jose
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: