Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-30474

Writing data to parquet with dynamic partitionOverwriteMode should not do the folder rename in commitjob stage

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 3.1.0
    • Fix Version/s: None
    • Component/s: Input/Output
    • Labels:
      None

      Description

      In the current spark implementation if you set,

      spark.sql.sources.partitionOverwriteMode=dynamic
      

      even with 

      mapreduce.fileoutputcommitter.algorithm.version=2
      

      it would still rename the partition folder sequentially in commitJob stage as shown here: 

      https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L188

      https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L184

       

      This is very slow in cloud storage. We should commit the data similar to FileOutputCommitter v2?

       

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                ZaishengDai Zaisheng Dai
              • Votes:
                1 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated: