Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-2707

Can't recover from a dead ROOT server if any exceptions happens during log splitting

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.90.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      There's an almost easy way to get stuck after a RS holding ROOT dies, usually from a GC-like event. It happens frequently to my TestReplication in HBASE-2223.

      Some logs:

      2010-06-10 11:35:52,090 INFO  [master] wal.HLog(1175): Spliting is done. Removing old log dir hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
      2010-06-10 11:35:52,095 WARN  [master] master.RegionServerOperationQueue(183): Failed processing: ProcessServerShutdown of 10.10.1.63,55846,1276194933831; putting onto delayed todo queue
      java.io.IOException: Cannot delete: hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
              at org.apache.hadoop.hbase.regionserver.wal.HLog.splitLog(HLog.java:1179)
              at org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:298)
              at org.apache.hadoop.hbase.master.RegionServerOperationQueue.process(RegionServerOperationQueue.java:149)
              at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:456)
      Caused by: java.io.IOException: java.io.IOException: /user/jdcryans/.logs/10.10.1.63,55846,1276194933831 is non empty
      2010-06-10 11:35:52,097 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
      2010-06-10 11:35:53,098 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
      2010-06-10 11:35:53,523 INFO  [main.serverMonitor] master.ServerManager$ServerMonitor(131): 1 region servers, 1 dead, average load 14.0[10.10.1.63,55846,1276194933831]
      2010-06-10 11:35:54,099 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
      2010-06-10 11:35:55,101 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
      

      The last lines are my own debug. Since we don't process the delayed todo if ROOT isn't online, we'll never reassign the regions.

        Attachments

        1. 2707-0.20.txt
          1 kB
          stack
        2. 2707-test.txt
          13 kB
          stack
        3. HBASE-2707.patch
          2 kB
          Jean-Daniel Cryans

          Issue Links

            Activity

              People

              • Assignee:
                stack stack
                Reporter:
                jdcryans Jean-Daniel Cryans
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: