[HADOOP-18568] Magic Committer optional clean up - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Wish
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 3.3.3
Fix Version/s: None
Component/s: fs/s3
Labels:
None

Description

It seems that deleting the `__magic` folder, depending on the number of tasks/partitions used on a given spark job, can take really long time. I'm having the following behavior on a given Spark job (processing ~30TB, with ~420k tasks) using the magic committer:

2022-12-10T21:25:19.629Z pool-3-thread-32 INFO MagicS3GuardCommitter: Starting: Deleting magic directory s3a://my-bucket/random_hash/__magic
2022-12-10T21:52:03.250Z pool-3-thread-32 INFO MagicS3GuardCommitter: Deleting magic directory s3a://my-bucket/random_hash/__magic: duration 26:43.620s

I don't see a way out of it since the deletion of s3 objects needs to list all objects under a prefix and this is what may be taking too much time. Could we somehow make this cleanup optional? (the idea would be to delegate it through s3 lifecycle policies in order to not create this overhead on the commit phase).

Attachments

Issue Links

relates to

HADOOP-17833 Improve Magic Committer Performance

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: André F.

Votes:: 1 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 12/Dec/22 08:53

Updated:: 18/Sep/23 04:59