Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-12743

[ITBLL] Master fails rejoining cluster stuck splitting logs; Distributed log replay=true

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Abandoned
    • Affects Version/s: None
    • Fix Version/s: 1.1.0, 1.0.2, 2.0.0
    • Component/s: None
    • Labels:
      None

      Description

      Master is stuck for two days trying to rejoin cluster after monkey killed and restarted it.

      After retrying to get namespace 350 times, Master goes down:

      2014-12-20 18:43:54,285 INFO  [c2020:16020.activeMasterManager] client.RpcRetryingCaller: Call exception, tries=349, retries=350, started=6885331 ms ago, cancelled=false, msg=row 'default' on table 'hbase:namespace' at region=hbase:namespace,,1417551886199.ecdcd0172cd3e32d291bc282771895da., hostname=c2023.halxg.cloudera.com,16020,1418988286696, seqNum=6000000190
      2014-12-20 18:43:54,303 WARN  [c2020:16020.activeMasterManager] master.TableNamespaceManager: Caught exception in initializing namespace table manager
      org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=350, exceptions:
      Sat Dec 20 16:49:08 PST 2014, RpcRetryingCaller{globalStartTime=1419122948954, pause=100, retries=350}, org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Region hbase:namespace,,1417551886199.ecdcd0172cd3e32d291bc282771895da. is not online on c2023.halxg.cloudera.com,16020,1418988286696
              at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2722)
              at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:851)
              at org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:1695)
              at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30434)
      

      Seems like 2014-12-20 16:49:03,665 INFO [RS_LOG_REPLAY_OPS-c2021:16020-0] wal.WALSplitter: DistributedLogReplay = true

      Seems easy enough to reproduce.

        Attachments

        1. 12743.hack.txt
          28 kB
          Michael Stack

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              stack Michael Stack
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: