Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Incomplete
-
2.4.3
-
None
Description
It is desirable to run concurrent jobs that write to different partitions within same baseDir using partitionBy and dynamic partitionOverwriteMode.
See for example here:
https://stackoverflow.com/questions/38964736/multiple-spark-jobs-appending-parquet-data-to-same-base-path-with-partitioning
Or the discussion here:
https://github.com/delta-io/delta/issues/9
This doesnt seem that difficult. I suspect only changes needed are in org.apache.spark.internal.io.HadoopMapReduceCommitProtocol, which already has a flag for dynamicPartitionOverwrite. I got a quick test to work by disabling all committer activity (committer.setupJob, committer.commitJob, etc.) when dynamicPartitionOverwrite is true.
Attachments
Issue Links
- relates to
-
SPARK-20236 Overwrite a partitioned data source table should only overwrite related partitions
- Resolved
- links to
See also:
https://mail-archives.apache.org/mod_mbox/spark-dev/201909.mbox/%3CCANx3uAinvf2LdtKfWUsykCJ%2BkHh6oYy0Pt_5LvcTSURGmQKQwg%40mail.gmail.com%3E