HBase
  1. HBase
  2. HBASE-2707

Can't recover from a dead ROOT server if any exceptions happens during log splitting

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.90.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      There's an almost easy way to get stuck after a RS holding ROOT dies, usually from a GC-like event. It happens frequently to my TestReplication in HBASE-2223.

      Some logs:

      2010-06-10 11:35:52,090 INFO  [master] wal.HLog(1175): Spliting is done. Removing old log dir hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
      2010-06-10 11:35:52,095 WARN  [master] master.RegionServerOperationQueue(183): Failed processing: ProcessServerShutdown of 10.10.1.63,55846,1276194933831; putting onto delayed todo queue
      java.io.IOException: Cannot delete: hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
              at org.apache.hadoop.hbase.regionserver.wal.HLog.splitLog(HLog.java:1179)
              at org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:298)
              at org.apache.hadoop.hbase.master.RegionServerOperationQueue.process(RegionServerOperationQueue.java:149)
              at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:456)
      Caused by: java.io.IOException: java.io.IOException: /user/jdcryans/.logs/10.10.1.63,55846,1276194933831 is non empty
      2010-06-10 11:35:52,097 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
      2010-06-10 11:35:53,098 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
      2010-06-10 11:35:53,523 INFO  [main.serverMonitor] master.ServerManager$ServerMonitor(131): 1 region servers, 1 dead, average load 14.0[10.10.1.63,55846,1276194933831]
      2010-06-10 11:35:54,099 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
      2010-06-10 11:35:55,101 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process delayedToDoQueue items
      

      The last lines are my own debug. Since we don't process the delayed todo if ROOT isn't online, we'll never reassign the regions.

      1. 2707-0.20.txt
        1 kB
        stack
      2. 2707-test.txt
        13 kB
        stack
      3. HBASE-2707.patch
        2 kB
        Jean-Daniel Cryans

        Issue Links

          Activity

          Jean-Daniel Cryans created issue -
          Jean-Daniel Cryans made changes -
          Field Original Value New Value
          Attachment HBASE-2707.patch [ 12446804 ]
          stack made changes -
          Assignee Jean-Daniel Cryans [ jdcryans ] stack [ stack ]
          Jean-Daniel Cryans made changes -
          Link This issue blocks HBASE-2223 [ HBASE-2223 ]
          stack made changes -
          Attachment 2707-test.txt [ 12448111 ]
          stack made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Hadoop Flags [Reviewed]
          Resolution Fixed [ 1 ]
          stack made changes -
          Attachment 2707-0.20.txt [ 12448135 ]
          stack made changes -
          Fix Version/s 0.20.6 [ 12315060 ]
          stack made changes -
          Fix Version/s 0.20.6 [ 12315060 ]
          stack made changes -
          Attachment 2707-v3.txt [ 12449602 ]
          stack made changes -
          Attachment 2707-v3.txt [ 12449602 ]

            People

            • Assignee:
              stack
              Reporter:
              Jean-Daniel Cryans
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development