Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-21316

All RegionServer Down when RS_LOG_REPLAY_OPS

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Patch Available
    • Major
    • Resolution: Unresolved
    • 2.0.0
    • None
    • regionserver

    Description

      1. One RegionServer die as unknow reason, log as follow:

      2018-10-14 20:31:47,423 INFO [main-SendThread(11.3.20.101:2181)] zookeeper.ClientCnxn: Socket connection established to 11.3.20.101/11.3.20.101:2181, initiating session 2018-10-14 20:31:47,433 INFO [main-SendThread(11.3.20.101:2181)] zookeeper.ClientCnxn: Session establishment complete on server 11.3.20.101/11.3.20.101:2181, sessionid = 0x6500073f944a8e79, negotiated timeout = 30000 2018-10-14日 Sunday 21:03:05 CST Starting regionserver on 11-3-19-199.JD.LOCAL core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited
      

      2. Master receive zk deletenode event, and start ServerCrashProcedure Task

      2018-10-14 20:31:47,437 INFO [main-EventThread] master.RegionServerTracker: RegionServer ephemeral node deleted, processing expiration [11-3-19-199.jd.local,16020,1539492869470]
      
      2018-10-14 20:31:47,539 INFO [PEWorker-1] procedure.ServerCrashProcedure: Start pid=25053, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure server=11-3-19-199.jd.local,16020,1539492869470, splitWal=true, meta=false
      
      2018-10-14 20:31:47,550 INFO [PEWorker-1] master.SplitLogManager: Started splitting 63 logs in [hdfs://11-3-18-67.JD.LOCAL:9000/hbase/WALs/11-3-19-199.jd.local,16020,1539492869470-splitting] for [11-3-19-199.jd.local,16020,1539492869470] ... 2018-10-14 20:31:48,592 INFO [main-EventThread] coordination.SplitLogManagerCoordination: Task /hbase/splitWAL/WALs%2F11-3-19-199.jd.local%2C16020%2C1539492869470-splitting%2F11-3-19-199.jd.local%252C16020%252C1539492869470.1539520250598 acquired by 11-3-18-71.jd.local,16020,1539492869409
      
      

      3. One alive RegionServer Node get SplitLogWorker,  has an error and stop

      2018-10-14 20:31:48,602 INFO [SplitLogWorker-11-3-18-71:16020] coordination.ZkSplitLogWorkerCoordination: worker 11-3-18-71.jd.local,16020,1539492869409 acquired task /hbase/splitWAL/WALs%2F11-3-19-199.jd.local%2C16020%2C1539492869470-splitting%2F11-3-19-199.jd.local%252C16020%252C1539492869470.1539520250598 
      ...
      2018-10-14 21:03:26,219 ERROR [RS_LOG_REPLAY_OPS-regionserver/11-3-18-71:16020-1] executor.EventHandler: Caught throwable while processing event RS_LOG_REPLAY
      java.lang.ArrayIndexOutOfBoundsException: 8811
      at org.apache.hadoop.hbase.KeyValue.getFamilyLength(KeyValue.java:1365)
      at org.apache.hadoop.hbase.KeyValue.getFamilyLength(KeyValue.java:1358)
      at org.apache.hadoop.hbase.PrivateCellUtil.matchingFamily(PrivateCellUtil.java:735)
      at org.apache.hadoop.hbase.CellUtil.matchingFamily(CellUtil.java:816)
      at org.apache.hadoop.hbase.wal.WALEdit.isMetaEditFamily(WALEdit.java:102)
      at org.apache.hadoop.hbase.wal.WALEdit.isMetaEdit(WALEdit.java:107)
      at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:296)
      at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:194)
      at org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:99)
      at org.apache.hadoop.hbase.regionserver.handler.WALSplitterHandler.process(WALSplitterHandler.java:70)
      at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      at java.lang.Thread.run(Thread.java:748)
      
      2018-10-14 21:03:26,227 ERROR [RS_LOG_REPLAY_OPS-regionserver/11-3-18-71:16020-1] regionserver.HRegionServer: ***** ABORTING region server 11-3-18-71.jd.local,16020,1539522186368: Caught throwable while processing event RS_LOG_REPLAY *****
      ....
      2018-10-14 20:31:48,780 INFO [RS_LOG_REPLAY_OPS-regionserver/11-3-18-71:16020-0] regionserver.HRegionServer: ***** STOPPING region server '11-3-18-71.jd.local,16020,1539492869409' *****
      
      

      4. other alive regionserver node die one by one, at last, all regionserver node die

      Attachments

        Activity

          People

            Unassigned Unassigned
            justice_103 justice
            Votes:
            1 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated: