Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-14531 [Umbrella] Improve S3A error handling & reporting
  3. HADOOP-14303

Review retry logic on all S3 SDK calls, implement where needed

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 2.8.0
    • None
    • fs/s3
    • None

    Description

      AWS S3, IAM, KMS, DDB etc all throttle callers: the S3A code needs to handle this without failing, as if it slows down its requests it can recover.

      1. Look at all the places where we are calling S3A via the AWS SDK and make sure we are retrying with some backoff & jitter policy, ideally something unified. This must be more systematic than the case-by-case, problem-by-problem strategy we are implicitly using.
      2. Many of the AWS S3 SDK calls do implement retry (e.g PUT/multipart PUT), but we need to check the other parts of the process: login, initiate/complete MPU, ...

      Related

      HADOOP-13811 Failed to sanitize XML document destined for handler class
      HADOOP-13664 S3AInputStream to use a retry policy on read failures

      This stuff is all hard to test. A key need is to be able to differentiate recoverable throttle & network failures from unrecoverable problems like: auth, network config (e.g bad endpoint), etc.

      May be the opportunity to add a faulting subclass of Amazon S3 client which can be configured in IT Tests to fail at specific points. Ryan Blue's mcok S3 client does this in HADOOP-13786, but it is for 100% mock. I'm thinking of something with similar fault raising, but in front of the real S3A client

      Attachments

        Issue Links

          Activity

            People

              stevel@apache.org Steve Loughran
              stevel@apache.org Steve Loughran
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: