Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-18568

Magic Committer optional clean up

    XMLWordPrintableJSON

Details

    • Wish
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 3.3.3
    • None
    • fs/s3
    • None

    Description

      It seems that deleting the `__magic` folder, depending on the number of tasks/partitions used on a given spark job, can take really long time. I'm having the following behavior on a given Spark job (processing ~30TB, with ~420k tasks) using the magic committer:

      2022-12-10T21:25:19.629Z pool-3-thread-32 INFO MagicS3GuardCommitter: Starting: Deleting magic directory s3a://my-bucket/random_hash/__magic
      2022-12-10T21:52:03.250Z pool-3-thread-32 INFO MagicS3GuardCommitter: Deleting magic directory s3a://my-bucket/random_hash/__magic: duration 26:43.620s 

      I don't see a way out of it since the deletion of s3 objects needs to list all objects under a prefix and this is what may be taking too much time. Could we somehow make this cleanup optional? (the idea would be to delegate it through s3 lifecycle policies in order to not create this overhead on the commit phase).

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              andre.amorimfonseca@gmail.com André F.
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: