Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-14828

RetryUpToMaximumTimeWithFixedSleep is not bounded by maximum time

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      In RetryPolicies.java, RetryUpToMaximumTimeWithFixedSleep is converted to a RetryUpToMaximumCountWithFixedSleep, whose count is the maxTime / sleepTime:

          public RetryUpToMaximumTimeWithFixedSleep(long maxTime, long sleepTime,
              TimeUnit timeUnit) {
            super((int) (maxTime / sleepTime), sleepTime, timeUnit);
            this.maxTime = maxTime;
            this.timeUnit = timeUnit;
          }
      

      But if retries take a long time, then the maxTime passed to the RetryUpToMaximumTimeWithFixedSleep is exceeded.

      As an example, while doing NM restarts, we saw an issue where the NMProxy creates a retry policy which specifies a maximum wait time of 15 minutes and a 10 sec interval (which is converted to a MaximumCount policy with 15 min / 10 sec = 90 tries). But each NMProxy retry policy invokes o.a.h.ipc.Client's retry policy:

            if (connectionRetryPolicy == null) {
              final int max = conf.getInt(
                  CommonConfigurationKeysPublic.IPC_CLIENT_CONNECT_MAX_RETRIES_KEY,
                  CommonConfigurationKeysPublic.IPC_CLIENT_CONNECT_MAX_RETRIES_DEFAULT);
              final int retryInterval = conf.getInt(
                  CommonConfigurationKeysPublic.IPC_CLIENT_CONNECT_RETRY_INTERVAL_KEY,
                  CommonConfigurationKeysPublic
                      .IPC_CLIENT_CONNECT_RETRY_INTERVAL_DEFAULT);
      
              connectionRetryPolicy = RetryPolicies.retryUpToMaximumCountWithFixedSleep(
                  max, retryInterval, TimeUnit.MILLISECONDS);
            }

      So the time it takes the NMProxy to fail is actually (90 retries) * (10 sec NMProxy interval + o.a.h.ipc.Client retry time). In the default case, ipc client retries 10 times with a 1 sec interval, meaning the time it takes for NMProxy to fail is (90)(10 sec + 10 sec) = 30 min instead of the 15 min specified by NMProxy configuration.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                jhung Jonathan Hung
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: