Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-14828

RetryUpToMaximumTimeWithFixedSleep is not bounded by maximum time

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • None
    • None
    • None
    • None

    Description

      In RetryPolicies.java, RetryUpToMaximumTimeWithFixedSleep is converted to a RetryUpToMaximumCountWithFixedSleep, whose count is the maxTime / sleepTime:

          public RetryUpToMaximumTimeWithFixedSleep(long maxTime, long sleepTime,
              TimeUnit timeUnit) {
            super((int) (maxTime / sleepTime), sleepTime, timeUnit);
            this.maxTime = maxTime;
            this.timeUnit = timeUnit;
          }
      

      But if retries take a long time, then the maxTime passed to the RetryUpToMaximumTimeWithFixedSleep is exceeded.

      As an example, while doing NM restarts, we saw an issue where the NMProxy creates a retry policy which specifies a maximum wait time of 15 minutes and a 10 sec interval (which is converted to a MaximumCount policy with 15 min / 10 sec = 90 tries). But each NMProxy retry policy invokes o.a.h.ipc.Client's retry policy:

            if (connectionRetryPolicy == null) {
              final int max = conf.getInt(
                  CommonConfigurationKeysPublic.IPC_CLIENT_CONNECT_MAX_RETRIES_KEY,
                  CommonConfigurationKeysPublic.IPC_CLIENT_CONNECT_MAX_RETRIES_DEFAULT);
              final int retryInterval = conf.getInt(
                  CommonConfigurationKeysPublic.IPC_CLIENT_CONNECT_RETRY_INTERVAL_KEY,
                  CommonConfigurationKeysPublic
                      .IPC_CLIENT_CONNECT_RETRY_INTERVAL_DEFAULT);
      
              connectionRetryPolicy = RetryPolicies.retryUpToMaximumCountWithFixedSleep(
                  max, retryInterval, TimeUnit.MILLISECONDS);
            }

      So the time it takes the NMProxy to fail is actually (90 retries) * (10 sec NMProxy interval + o.a.h.ipc.Client retry time). In the default case, ipc client retries 10 times with a 1 sec interval, meaning the time it takes for NMProxy to fail is (90)(10 sec + 10 sec) = 30 min instead of the 15 min specified by NMProxy configuration.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jhung Jonathan Hung
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: