Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-10789

RM HA startup can fail due to race conditions in ZKConfigurationStore

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.0.0
    • 3.4.0, 3.2.3, 3.3.2
    • capacity scheduler
    • None

    Description

      We are observing below error randomly during hadoop install and RM initial startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is configured. This causes one of the RMs to not startup.

      2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: Service RMActiveServices failed in state INITED
      org.apache.hadoop.service.ServiceStateException: java.io.IOException: org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /confstore/CONF_STORE
      

      We are trying to create the znode /confstore/CONF_STORE when we initialize the ZKConfigurationStore. But the problem is that the ZKConfigurationStore is initialized when CapacityScheduler does a serviceInit. This serviceInit is done by both Active and Standby RM. So we can run into a race condition when both Active and Standby try to create the same znode when both RM are started at same time.

      ZKRMStateStore on the other hand avoids such race conditions, by creating the znodes only after serviceStart. serviceStart only happens for the active RM which won the leader election, unlike serviceInit which happens irrespective of leader election.

      Attachments

        1. YARN-10789.branch-3.3.001.patch
          3 kB
          Tarun Parimi
        2. YARN-10789.branch-3.2.001.patch
          3 kB
          Tarun Parimi
        3. YARN-10789.002.patch
          3 kB
          Tarun Parimi
        4. YARN-10789.001.patch
          3 kB
          Tarun Parimi

        Issue Links

          Activity

            People

              tarunparimi Tarun Parimi
              tarunparimi Tarun Parimi
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: