[SPARK-28945] Allow concurrent writes to different partitions with dynamic partition overwrite - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Incomplete
Affects Version/s: 2.4.3
Fix Version/s: None
Component/s: SQL
Labels:
- bulk-closed

Description

It is desirable to run concurrent jobs that write to different partitions within same baseDir using partitionBy and dynamic partitionOverwriteMode.

See for example here:
https://stackoverflow.com/questions/38964736/multiple-spark-jobs-appending-parquet-data-to-same-base-path-with-partitioning

Or the discussion here:
https://github.com/delta-io/delta/issues/9

This doesnt seem that difficult. I suspect only changes needed are in org.apache.spark.internal.io.HadoopMapReduceCommitProtocol, which already has a flag for dynamicPartitionOverwrite. I got a quick test to work by disabling all committer activity (committer.setupJob, committer.commitJob, etc.) when dynamicPartitionOverwrite is true.

Attachments

Issue Links

relates to

SPARK-20236 Overwrite a partitioned data source table should only overwrite related partitions

Resolved

links to

GitHub Pull Request #25739

GitHub Pull Request #25863

Activity

Ascending order - Click to sort in descending order

koert kuipers added a comment - 02/Sep/19 03:58

See also:
https://mail-archives.apache.org/mod_mbox/spark-dev/201909.mbox/%3CCANx3uAinvf2LdtKfWUsykCJ%2BkHh6oYy0Pt_5LvcTSURGmQKQwg%40mail.gmail.com%3E

koert kuipers added a comment - 02/Sep/19 03:58 See also: https://mail-archives.apache.org/mod_mbox/spark-dev/201909.mbox/%3CCANx3uAinvf2LdtKfWUsykCJ%2BkHh6oYy0Pt_5LvcTSURGmQKQwg%40mail.gmail.com%3E

Wenchen Fan added a comment - 05/Sep/19 07:56

cc advancedxy do you want to work on it?

Wenchen Fan added a comment - 05/Sep/19 07:56 cc advancedxy do you want to work on it?

YE added a comment - 05/Sep/19 09:13

Yeah, I can work on this.

YE added a comment - 05/Sep/19 09:13 Yeah, I can work on this.

Steve Loughran added a comment - 05/Sep/19 11:32

It's a core part of the hadoop MR commit protocols. I think the best (only!) docs of these other than the most confusing piece of co-recursive code I've ever had to step through taking notes of is : https://github.com/steveloughran/zero-rename-committer/releases/tag/tag_draft_005

every MR app attempt has its own attempt ID; when the hadoop MR engine attempt N is restarted it looks for the temp dir of N-1 and can use this to recover from failure. Spark's solution to the app restart problem is "be faster and fix failures by restarting entirely", so the app attempt is always 0

If you have two jobs writing to same destination path, their output is inevitably going to conflict and as the first job commit will delete the attempt dir then the second will fail.

You need to (somehow) get a different attempt ID for each job to avoid that clash.
jobs to set "mapreduce.fileoutputcommitter.cleanup.skipped" to false to avoid a full cleanup of _temporary on job commit. That's got a risk of leaking temp files after job failures.

Steve Loughran added a comment - 05/Sep/19 11:32 It's a core part of the hadoop MR commit protocols. I think the best (only!) docs of these other than the most confusing piece of co-recursive code I've ever had to step through taking notes of is : https://github.com/steveloughran/zero-rename-committer/releases/tag/tag_draft_005 every MR app attempt has its own attempt ID; when the hadoop MR engine attempt N is restarted it looks for the temp dir of N-1 and can use this to recover from failure. Spark's solution to the app restart problem is "be faster and fix failures by restarting entirely", so the app attempt is always 0 If you have two jobs writing to same destination path, their output is inevitably going to conflict and as the first job commit will delete the attempt dir then the second will fail. You need to (somehow) get a different attempt ID for each job to avoid that clash. jobs to set "mapreduce.fileoutputcommitter.cleanup.skipped" to false to avoid a full cleanup of _temporary on job commit. That's got a risk of leaking temp files after job failures.

feiwang added a comment - 15/Sep/19 17:11 - edited

cloud_fan advancedxy
Hi, I think the exception shown in the email(https://mail-archives.apache.org/mod_mbox/spark-dev/201909.mbox/%3CCANx3uAinvf2LdtKfWUsykCJ%2BkHh6oYy0Pt_5LvcTSURGmQKQwg%40mail.gmail.com%3E) is related with this PR(https://github.com/apache/spark/pull/25795).

When dynamicPartitionOverwrite is true, we should skip commitJob.

feiwang added a comment - 15/Sep/19 17:11 - edited cloud_fan advancedxy Hi, I think the exception shown in the email( https://mail-archives.apache.org/mod_mbox/spark-dev/201909.mbox/%3CCANx3uAinvf2LdtKfWUsykCJ%2BkHh6oYy0Pt_5LvcTSURGmQKQwg%40mail.gmail.com%3E ) is related with this PR( https://github.com/apache/spark/pull/25795 ). When dynamicPartitionOverwrite is true, we should skip commitJob.

YE added a comment - 16/Sep/19 02:23

> Hi, I think the exception shown in the email(https://mail-archives.apache.org/mod_mbox/spark-dev/201909.mbox/%3CCANx3uAinvf2LdtKfWUsykCJ%2BkHh6oYy0Pt_5LvcTSURGmQKQwg%40mail.gmail.com%3E) is related with this PR(https://github.com/apache/spark/pull/25795).

> When dynamicPartitionOverwrite is true, we should skip commitJob.

Well, it's not that simple. If you are going to skip commitJob for dynamicPartitionOverwrite, you should skip setupJob and abortJob for consistency.

As we discussed offline, I believe the fix in my pr https://github.com/apache/spark/pull/25739 should cover the concurrent write for dynamicPartitionOverwrite. Adding output existence check and set specific output path for strict dynamic partition should also cover your cases too.

YE added a comment - 16/Sep/19 02:23 > Hi, I think the exception shown in the email( https://mail-archives.apache.org/mod_mbox/spark-dev/201909.mbox/%3CCANx3uAinvf2LdtKfWUsykCJ%2BkHh6oYy0Pt_5LvcTSURGmQKQwg%40mail.gmail.com%3E ) is related with this PR( https://github.com/apache/spark/pull/25795 ). > When dynamicPartitionOverwrite is true, we should skip commitJob. Well, it's not that simple. If you are going to skip commitJob for dynamicPartitionOverwrite, you should skip setupJob and abortJob for consistency. As we discussed offline, I believe the fix in my pr https://github.com/apache/spark/pull/25739 should cover the concurrent write for dynamicPartitionOverwrite. Adding output existence check and set specific output path for strict dynamic partition should also cover your cases too.

feiwang added a comment - 16/Sep/19 04:19 - edited

advancedxy
Thanks. Hope ~~SPARK-28945~~ can be merged soon.
It is important for data quality.

feiwang added a comment - 16/Sep/19 04:19 - edited advancedxy Thanks. Hope SPARK-28945 can be merged soon. It is important for data quality.

koert kuipers added a comment - 13/Nov/19 21:28

i understand there is a great deal of complexity in the committer and this might require more work to get it right

but its still unclear to me if the committer is doing anything at all in case of dynamic partition overwrite.
what do i lose by disabling all committer activity (committer.setupJob, committer.commitJob, etc.) when dynamicPartitionOverwrite is true? and if i lose nothing, is that a good thing, or does that mean i should be worried about the current state?

koert kuipers added a comment - 13/Nov/19 21:28 i understand there is a great deal of complexity in the committer and this might require more work to get it right but its still unclear to me if the committer is doing anything at all in case of dynamic partition overwrite. what do i lose by disabling all committer activity (committer.setupJob, committer.commitJob, etc.) when dynamicPartitionOverwrite is true? and if i lose nothing, is that a good thing, or does that mean i should be worried about the current state?

People

Assignee:: Unassigned

Reporter:: koert kuipers

Votes:: 5 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 02/Sep/19 03:56

Updated:: 25/May/21 01:52

Resolved:: 25/May/21 01:41