HBase
  1. HBase
  2. HBASE-3138

When new master joins running cluster but meta is yanked from it as processing RIT, gets unexpected state

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Cannot Reproduce
    • Affects Version/s: None
    • Fix Version/s: 0.92.3
    • Component/s: None
    • Labels:
      None

      Description

      Testing rolling restart i turned up the following condition.

      Master is joining an extant cluster and is trying to clean up RIT. Then the server hosting .META. is shutdown in the middle of it all. Deal. Here is exception.

      2010-10-21 06:45:58,592 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=sv2borg187,60020,1287643131919, region=efcd899283e96f20faa317772f52adca
      2010-10-21 06:45:58,616 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception. Starting shutdown.
      org.apache.hadoop.ipc.RemoteException: java.io.IOException: Server not running
          at org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(HRegionServer.java:2198)
          at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1499)
          at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
          at java.lang.reflect.Method.invoke(Method.java:597)
          at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:561)
          at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1025)
      
          at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:749)
          at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:255)
          at $Proxy1.get(Unknown Source)
          at org.apache.hadoop.hbase.catalog.MetaReader.getRegion(MetaReader.java:286)
          at org.apache.hadoop.hbase.master.AssignmentManager.processRegionInTransition(AssignmentManager.java:250)
          at org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:209)
          at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:392)
          at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:268)
      2010-10-21 06:45:58,617 INFO org.apache.hadoop.hbase.master.HMaster: Aborting
      

        Issue Links

          Activity

          Hide
          Jonathan Gray added a comment -

          This is a little tricky. What should we do when we get an exception processing a RIT during failover? We could just log it and move on. If we ensure that we put the state into RIT in-memory map as soon as possible, then even if we get an exception, we'll time it out later and we won't lose track.

          Show
          Jonathan Gray added a comment - This is a little tricky. What should we do when we get an exception processing a RIT during failover? We could just log it and move on. If we ensure that we put the state into RIT in-memory map as soon as possible, then even if we get an exception, we'll time it out later and we won't lose track.
          Hide
          Jonathan Gray added a comment -

          The issue is we don't have an HRI for the region-in-transition because we can't get to META. Without the HRI, we can't properly setup the state in RIT map.

          We could serialize the HRI into the RIT node and use it there (right now we have byte [] regionName in there).

          Show
          Jonathan Gray added a comment - The issue is we don't have an HRI for the region-in-transition because we can't get to META. Without the HRI, we can't properly setup the state in RIT map. We could serialize the HRI into the RIT node and use it there (right now we have byte [] regionName in there).
          Hide
          stack added a comment -

          Regionname in RIT is close to useless. Serialized HRI would for sure be better.

          Show
          stack added a comment - Regionname in RIT is close to useless. Serialized HRI would for sure be better.
          Hide
          stack added a comment -

          Moving out. Its failure of master joining running cluster. Can restart master. Should join fine on second attempt. That should be fine for 0.90.0.

          Show
          stack added a comment - Moving out. Its failure of master joining running cluster. Can restart master. Should join fine on second attempt. That should be fine for 0.90.0.
          Hide
          stack added a comment -

          Marking as major rather than critical. Can restart the master which is less than ideal but doubt I'll get to this issue soon enough to do better fix.

          Show
          stack added a comment - Marking as major rather than critical. Can restart the master which is less than ideal but doubt I'll get to this issue soon enough to do better fix.
          Hide
          Andrew Purtell added a comment -

          Reopen if reproducible with current shipping code

          Show
          Andrew Purtell added a comment - Reopen if reproducible with current shipping code

            People

            • Assignee:
              Unassigned
              Reporter:
              stack
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development