Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-16260

Allow Distcp to create a new tempTarget file per File

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Won't Fix
    • Affects Version/s: 2.9.2
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      We use distcp to copy entire HDFS clusters to GCS.
      In the process, we hit the following error:

      INFO: Encountered status code 410 when accessing URL https://www.googleapis.com/upload/storage/v1/b/app/o?ifGenerationMatch=0&name=analytics/.distcp.tmp.attempt_local1083459072_0001_m_000000_0&uploadType=resumable&upload_id=AEnB2Uq4mZeZxXgs2Mhx0uskNpZ4Cka8pT4aCcd7v6UC4TDQx-h0uEFWoPpdOO4pWEdmaKnhTjxVva5Ow4vXbTe6_JScIU5fsQSaIwNkF3D84DHjtuhKSCU. Delegating to response handler for possible retry.
      Apr 14, 2019 5:53:17 AM com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation call
      SEVERE: Exception not convertible into handled response
      com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.json.GoogleJsonResponseException: 410 Gone
      {
        "code" : 429,
        "errors" : [ {
          "domain" : "usageLimits",
          "message" : "The total number of changes to the object app/folder/.distcp.tmp.attempt_local1083459072_0001_m_000000_0 exceeds the rate limit. Please reduce the rate of create, update, and delete requests.",
          "reason" : "rateLimitExceeded"
        } ],
        "message" : "The total number of changes to the object app/folder/.distcp.tmp.attempt_local1083459072_0001_m_000000_0 exceeds the rate limit. Please reduce the rate of create, update, and delete requests."
      }
             at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:150)
              at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
              at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
              at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:432)
              at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
              at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
              at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation.call(AbstractGoogleAsyncWriteChannel.java:301)
              at java.util.concurrent.FutureTask.run(FutureTask.java:266)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at java.lang.Thread.run(Thread.java:748)
       
      

      Looking at the code, it looks like a distCp mapper gets a list of files to copy from src to target filesystem. The mapper handles each file in its list sequentially: It first creates/overwrites a temp file (.distcp.tmp.attempt_local1083459072_0001_m_000000_0), then it copies the src file to the temp file, and finally renames the temp file to the actual target file.
      The temp file name (which contains the task ID) is reused for all the files in the mapper's batch. It looks like GCP enforces a rate-limit on the number of operations per sec on any object (even though we are actually creating a new file and renaming it to the final target, gcp assumes we are making changes to the same object)

      Even though it is possible to play around with the number of Maps / split size etc. It is hard to arrive at one of those values based on any rate-limit.

      Thus, we propose we add a flag to allow the DistCp mapper to use a different temp file PER file.

      Thoughts ? (cc/Steve Loughran, Benoy Antony)

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                asuresh Arun Suresh
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: