Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-18797

Support Concurrent Writes With S3A Magic Committer

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Reviewed

    Description

      There is a failure in the commit process when multiple jobs are writing to a s3 directory concurrently using magic committers.

      This issue is closely related HADOOP-17318.

      When multiple Spark jobs write to the same S3A directory, they upload files simultaneously using "__magic" as the base directory for staging. Inside this directory, there are multiple "/job-some-uuid" directories, each representing a concurrently running job.

      To fix some preoblems related to concunrrency a property was introduced in the previous fix: "spark.hadoop.fs.s3a.committer.abort.pending.uploads". When set to false, it ensures that during the cleanup stage, finalizing jobs do not abort pending uploads from other jobs. So we see in logs this line: 

      DEBUG [main] o.a.h.fs.s3a.commit.AbstractS3ACommitter (819): Not cleanup up pending uploads to s3a ...

      (from AbstractS3ACommitter.java#L952)

      However, in the next step, the "__magic" directory is recursively deleted:

      INFO  [main] o.a.h.fs.s3a.commit.magic.MagicS3GuardCommitter (98): Deleting magic directory s3a://my-bucket/my-table/__magic: duration 0:00.560s 

      (from AbstractS3ACommitter.java#L1112 and MagicS3GuardCommitter.java#L137)

      This deletion operation affects the second job that is still running because it loses pending uploads (i.e., ".pendingset" and ".pending" files).

      The consequences can range from an exception in the best case to a silent loss of data in the worst case. The latter occurs when Job_1 deletes files just before Job_2 executes "listPendingUploadsToCommit" to list ".pendingset" files in the job attempt directory previous to complete the uploads with POST requests.

      To resolve this issue, it's important to ensure that only the prefix associated with the job currently finalizing is cleaned.

      Here's a possible solution:

      /**
       * Delete the magic directory.
       */
      public void cleanupStagingDirs() {
        final Path out = getOutputPath();
       //Path path = magicSubdir(getOutputPath());
        Path path = new Path(magicSubdir(out), formatJobDir(getUUID()));
      
        try(DurationInfo ignored = new DurationInfo(LOG, true,
            "Deleting magic directory %s", path)) {
          Invoker.ignoreIOExceptions(LOG, "cleanup magic directory", path.toString(),
              () -> deleteWithWarning(getDestFS(), path, true));
        }
      } 

       

      The side effect of this issue is that the "__magic" directory is never cleaned up. However, I believe this is a minor concern, even considering that other folders such as "_SUCCESS" also persist after jobs end.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            srahman Syed Shameerur Rahman
            emanuelvelzi Emanuel Velzi
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment