Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-14498

Master stuck in infinite loop when all Zookeeper servers are unreachable (and RS may run after losing its znode)

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Patch Available
    • Blocker
    • Resolution: Unresolved
    • 3.0.0-alpha-1, 1.5.0, 2.0.0, 2.2.0
    • 3.0.0-beta-2
    • master
    • None
    • Reviewed

    Description

      We met a weird scenario in our production environment.
      In a HA cluster,
      > Active Master (HM1) is not able to connect to any Zookeeper server (due to N/w breakdown on master machine network with Zookeeper servers).

      2015-09-26 15:24:47,508 INFO [HM1-Host:16000.activeMasterManager-SendThread(ZK-Host:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 33463ms for sessionid 0x104576b8dda0002, closing socket connection and attempting reconnect
      2015-09-26 15:24:47,877 INFO [HM1-Host:16000.activeMasterManager-SendThread(ZK-Host1:2181)] client.FourLetterWordMain: connecting to ZK-Host1 2181
      2015-09-26 15:24:48,236 INFO [main-SendThread(ZK-Host1:2181)] client.FourLetterWordMain: connecting to ZK-Host1 2181
      2015-09-26 15:24:49,879 WARN [HM1-Host:16000.activeMasterManager-SendThread(ZK-Host1:2181)] zookeeper.ClientCnxn: Can not get the principle name from server ZK-Host1
      2015-09-26 15:24:49,879 INFO [HM1-Host:16000.activeMasterManager-SendThread(ZK-Host1:2181)] zookeeper.ClientCnxn: Opening socket connection to server ZK-Host1/ZK-IP1:2181. Will not attempt to authenticate using SASL (unknown error)
      2015-09-26 15:24:50,238 WARN [main-SendThread(ZK-Host1:2181)] zookeeper.ClientCnxn: Can not get the principle name from server ZK-Host1
      2015-09-26 15:24:50,238 INFO [main-SendThread(ZK-Host1:2181)] zookeeper.ClientCnxn: Opening socket connection to server ZK-Host1/ZK-Host1:2181. Will not attempt to authenticate using SASL (unknown error)
      2015-09-26 15:25:17,470 INFO [main-SendThread(ZK-Host1:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 30023ms for sessionid 0x2045762cc710006, closing socket connection and attempting reconnect
      2015-09-26 15:25:17,571 WARN [master/HM1-Host/HM1-IP:16000] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=ZK-Host:2181,ZK-Host1:2181,ZK-Host2:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master
      2015-09-26 15:25:17,872 INFO [main-SendThread(ZK-Host:2181)] client.FourLetterWordMain: connecting to ZK-Host 2181
      2015-09-26 15:25:19,874 WARN [main-SendThread(ZK-Host:2181)] zookeeper.ClientCnxn: Can not get the principle name from server ZK-Host
      2015-09-26 15:25:19,874 INFO [main-SendThread(ZK-Host:2181)] zookeeper.ClientCnxn: Opening socket connection to server ZK-Host/ZK-IP:2181. Will not attempt to authenticate using SASL (unknown error)
      

      > Since HM1 was not able to connect to any ZK, so session timeout didnt happen at Zookeeper server side and HM1 didnt abort.

      > On Zookeeper session timeout standby master (HM2) registered himself as an active master.
      > HM2 is keep on waiting for region server to report him as part of active master intialization.

       
      2015-09-26 15:24:44,928 | INFO | HM2-Host:21300.activeMasterManager | Waiting for region servers count to settle; currently checked in 0, slept for 0 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms. | org.apache.hadoop.hbase.master.ServerManager.waitForRegionServers(ServerManager.java:1011)
      ---
      ---
      2015-09-26 15:32:50,841 | INFO | HM2-Host:21300.activeMasterManager | Waiting for region servers count to settle; currently checked in 0, slept for 483913 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms. | org.apache.hadoop.hbase.master.ServerManager.waitForRegionServers(ServerManager.java:1011)
      

      > At other end, region servers are reporting to HM1 on 3 sec interval. Here region server retrieve master location from zookeeper only when they couldn't connect to Master (ServiceException).
      Region Server will not report HM2 as per current design until unless HM1 abort,so HM2 will exit(InitializationMonitor) and again wait for region servers in loop.

      Attachments

        1. HBASE-14498.009.patch
          12 kB
          Pankaj Kumar
        2. HBASE-14498.009.patch
          12 kB
          Pankaj Kumar
        3. HBASE-14498.008.patch
          12 kB
          Pankaj Kumar
        4. HBASE-14498.007.patch
          12 kB
          Pankaj Kumar
        5. HBASE-14498-branch-1.2.patch
          11 kB
          Pankaj Kumar
        6. HBASE-14498-branch-1.3-V2.patch
          11 kB
          Pankaj Kumar
        7. HBASE-14498-branch-1.3.patch
          11 kB
          Pankaj Kumar
        8. HBASE-14498-branch-1.4.patch
          11 kB
          Pankaj Kumar
        9. HBASE-14498-branch-1.patch
          11 kB
          Pankaj Kumar
        10. HBASE-14498-addendum.patch
          2 kB
          Pankaj Kumar
        11. HBASE-14498.master.002.patch
          11 kB
          Pankaj Kumar
        12. HBASE-14498.master.001.patch
          11 kB
          Michael Stack
        13. HBASE-14498-V6.patch
          11 kB
          Pankaj Kumar
        14. HBASE-14498-V6.patch
          11 kB
          Pankaj Kumar
        15. HBASE-14498-V5.patch
          11 kB
          Pankaj Kumar
        16. HBASE-14498-V4.patch
          7 kB
          Pankaj Kumar
        17. HBASE-14498-V3.patch
          7 kB
          Pankaj Kumar
        18. HBASE-14498-V2.patch
          7 kB
          Pankaj Kumar
        19. HBASE-14498.patch
          4 kB
          Pankaj Kumar

        Issue Links

          Activity

            People

              pankaj2461 Pankaj Kumar
              sreenivasulureddy Y. SREENIVASULU REDDY
              Votes:
              0 Vote for this issue
              Watchers:
              20 Start watching this issue

              Dates

                Created:
                Updated: