Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-11355

YARN Client Failovers immediately to rm2 but takes ~30000ms to rm3

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Patch Available
    • Major
    • Resolution: Unresolved
    • 3.4.0
    • None
    • client
    • None

    Description

      YARN Client Failovers immediately to rm2 but takes ~30000ms to rm3 during initial retry.

      Repro:

      1. YARN Cluster with three master nodes rm1,rm2 and rm3
      2. rm3 is active
      3. yarn node -list or any other yarn client calls takes more than 30 seconds.
       

      The initial failover to rm2 is immediate but then the failover to rm3 is after ~30000 ms. Current RetryPolicy does not honor the number of master nodes. It has to perform atleast one immediate failover to every rm.

      2022-10-20 06:37:44,123 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
      2022-10-20 06:37:44,129 INFO retry.RetryInvocationHandler: java.net.ConnectException: Call From local to remote:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused, while invoking ApplicationClientProtocolPBClientImpl.getClusterNodes over rm2 after 1 failover attempts. Trying to failover after sleeping for 21139ms.
      

       

      Workaround:

      Reduce yarn.resourcemanager.connect.retry-interval.ms from 30000 to like 100. This will do immediate failover to rm3 but there will be too many retries when there is no active resourcemanager.
       

       

      Attachments

        1. YARN-11355.diff
          4 kB
          Vineeth Naroju

        Activity

          People

            vineethNaroju Vineeth Naroju
            prabhujoseph Prabhu Joseph
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: