HBase
  1. HBase
  2. HBASE-3304

Get spurious master fails during bootup

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.90.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      the log says:

      2010-12-01 20:42:21,115 WARN
      org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
      Remove exception connecting to RS
      org.apache.hadoop.ipc.RemoteException:
      org.apache.hadoop.hbase.ipc.ServerNotRunningException: Server is not
      running yet
      at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1035)

      at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:753)
      at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
      at $Proxy6.getProtocolVersion(Unknown Source)
      at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:419)
      at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:393)
      at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:444)
      at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:349)
      at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:953)
      at org.apache.hadoop.hbase.catalog.CatalogTracker.getCachedConnection(CatalogTracker.java:384)
      at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForRootServerConnection(CatalogTracker.java:210)
      at org.apache.hadoop.hbase.catalog.CatalogTracker.verifyRootRegionLocation(CatalogTracker.java:453)
      at org.apache.hadoop.hbase.master.HMaster.assignRootAndMeta(HMaster.java:421)
      at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:379)
      at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:274)
      2010-12-01 20:42:21,118 FATAL org.apache.hadoop.hbase.master.HMaster:
      Unhandled exception. Starting shutdown.
      org.apache.hadoop.hbase.ipc.ServerNotRunningException:
      org.apache.hadoop.hbase.ipc.ServerNotRunningException: Server is not
      running yet
      at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1035)

      at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
      at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
      at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
      at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
      at org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:96)
      at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:959)
      at org.apache.hadoop.hbase.catalog.CatalogTracker.getCachedConnection(CatalogTracker.java:384)
      at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForRootServerConnection(CatalogTracker.java:210)
      at org.apache.hadoop.hbase.catalog.CatalogTracker.verifyRootRegionLocation(CatalogTracker.java:453)
      at org.apache.hadoop.hbase.master.HMaster.assignRootAndMeta(HMaster.java:421)
      at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:379)
      at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:274)
      2010-12-01 20:42:21,119 INFO org.apache.hadoop.hbase.master.HMaster: Aborting
      2010-12-01 20:42:21,119 DEBUG org.apache.hadoop.hbase.master.HMaster:
      Stopping service threads

      then the master exits. the cluster doesn't start.

      1. 3304-v2.txt
        2 kB
        stack
      2. hbase-3304.txt
        0.5 kB
        ryan rawson

        Activity

        Hide
        stack added a comment -

        OK. I applied my patch to trunk and 0.90. Lets see if it fixes Ryans's issue. Ryan, you can bonk me on head if this fix turns out to be way wrong.

        Show
        stack added a comment - OK. I applied my patch to trunk and 0.90. Lets see if it fixes Ryans's issue. Ryan, you can bonk me on head if this fix turns out to be way wrong.
        Hide
        Jean-Daniel Cryans added a comment -

        +1 on latest patch if it passes unit tests.

        Show
        Jean-Daniel Cryans added a comment - +1 on latest patch if it passes unit tests.
        Hide
        stack added a comment -

        Exception is when we are verifying the root location at startup to see if we should reassign root. When we go to do that, we get ServerNotRunningException... so root can't have been assigned. Therefore, catch this exception and return null for root not assigned... and higher up the assignment will be dealt with.

        Show
        stack added a comment - Exception is when we are verifying the root location at startup to see if we should reassign root. When we go to do that, we get ServerNotRunningException... so root can't have been assigned. Therefore, catch this exception and return null for root not assigned... and higher up the assignment will be dealt with.
        Hide
        stack added a comment -

        Now I think on it, I think it intentional that the ordering was as committed. Here's where it was changed:

        ------------------------------------------------------------------------
        r1032812 | rawson | 2010-11-08 18:02:27 -0800 (Mon, 08 Nov 2010) | 3 lines
        
        HBASE-3141  Master RPC server needs to be started before an RS can check in
        

        Retry seems like the thing to add here?

        Show
        stack added a comment - Now I think on it, I think it intentional that the ordering was as committed. Here's where it was changed: ------------------------------------------------------------------------ r1032812 | rawson | 2010-11-08 18:02:27 -0800 (Mon, 08 Nov 2010) | 3 lines HBASE-3141 Master RPC server needs to be started before an RS can check in Retry seems like the thing to add here?
        Hide
        stack added a comment -

        @Ryan Was it easy to repro? And w/ this change it goes away?

        Show
        stack added a comment - @Ryan Was it easy to repro? And w/ this change it goes away?
        Hide
        Jean-Daniel Cryans added a comment -

        I'm not at ease with this patch, what seems to happen is that the CatalogTracker is trying to talk to the old RS but it's IPC server isn't started yet. Then we don't handle the thrown exception. Instead we could just retry... it seems less risky than playing in HBaseServer although I could be totally wrong.

        Show
        Jean-Daniel Cryans added a comment - I'm not at ease with this patch, what seems to happen is that the CatalogTracker is trying to talk to the old RS but it's IPC server isn't started yet. Then we don't handle the thrown exception. Instead we could just retry... it seems less risky than playing in HBaseServer although I could be totally wrong.
        Hide
        ryan rawson added a comment -

        appears we are initializing these in the wrong order.

        Show
        ryan rawson added a comment - appears we are initializing these in the wrong order.

          People

          • Assignee:
            ryan rawson
            Reporter:
            ryan rawson
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development