Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-3445

Master crashes on data that was moved from different host

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 0.90.0
    • 0.90.1
    • master
    • None
    • Reviewed
    • master

    Description

      While testing an upgrade to 0.90.0 RC3 I noticed that if I seeded our test data on one machine and transferred to another machine the HMaster on the new machine dies on startup.

      Based on the following stack trace it looks as though it is attempting to find the .meta region with the ip address of the original machine. Instead of waiting around for RegionServer's to register with new location data, HMaster throws it's hands up with a FATAL exception.

      Note that deleting the zookeeper dir makes no difference.

      Also note that so far I have only reproduced this in my own environment using the hbase-trx extension of HBase and an ApplicationStarter that starts the Master and RegionServer together in the same JVM. While the issue seems likely isolated from those factors it is far from a vanilla HBase environment.

      I will spend some time trying to reproduce the issue in a proper hbase test. But perhaps someone can beat me to it? How do I simulate the IP switch? May require a data.tar upload.

      [14/01/11 10:45:20] 6396 [ Thread-298] ERROR server.quorum.QuorumPeerConfig - Invalid configuration, only one server specified (ignoring)
      [14/01/11 10:45:21] 7178 [ main] INFO ion.service.HBaseRegionService - troove> region port: 60010
      [14/01/11 10:45:21] 7180 [ main] INFO ion.service.HBaseRegionService - troove> region interface: org.apache.hadoop.hbase.ipc.IndexedRegionInterface
      [14/01/11 10:45:21] 7180 [ main] INFO ion.service.HBaseRegionService - troove> root dir: hdfs://localhost:8701/hbase
      [14/01/11 10:45:21] 7180 [ main] INFO ion.service.HBaseRegionService - troove> Initializing region server.
      [14/01/11 10:45:21] 7631 [ main] INFO ion.service.HBaseRegionService - troove> Starting region server thread.
      [14/01/11 10:46:54] 100764 [ HMaster] FATAL he.hadoop.hbase.master.HMaster - Unhandled exception. Starting shutdown.
      java.net.SocketTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=192.168.1.102/192.168.1.102:60020]
      at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:213)
      at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)
      at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:311)
      at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:865)
      at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:732)
      at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:258)
      at $Proxy14.getProtocolVersion(Unknown Source)
      at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:419)
      at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:393)
      at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:444)
      at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:349)
      at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:954)
      at org.apache.hadoop.hbase.catalog.CatalogTracker.getCachedConnection(CatalogTracker.java:384)
      at org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:283)
      at org.apache.hadoop.hbase.catalog.CatalogTracker.verifyMetaRegionLocation(CatalogTracker.java:478)
      at org.apache.hadoop.hbase.master.HMaster.assignRootAndMeta(HMaster.java:435)
      at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:382)
      at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:277)

      Attachments

        1. 3445-v2.txt
          6 kB
          Michael Stack
        2. 3445-refactor.txt
          13 kB
          Michael Stack
        3. 3445_0.90.0.patch
          1 kB
          James Kennedy

        Activity

          larsfrancke Lars Francke added a comment -

          This issue was closed as part of a bulk closing operation on 2015-11-20. All issues that have been resolved and where all fixVersions have been released have been closed (following discussions on the mailing list).

          larsfrancke Lars Francke added a comment - This issue was closed as part of a bulk closing operation on 2015-11-20. All issues that have been resolved and where all fixVersions have been released have been closed (following discussions on the mailing list).
          hudson Hudson added a comment -

          Integrated in HBase-TRUNK #1719 (See https://hudson.apache.org/hudson/job/HBase-TRUNK/1719/)

          hudson Hudson added a comment - Integrated in HBase-TRUNK #1719 (See https://hudson.apache.org/hudson/job/HBase-TRUNK/1719/ )
          stack Michael Stack added a comment -

          Committed to branch and trunk. Thanks for the patch James.

          stack Michael Stack added a comment - Committed to branch and trunk. Thanks for the patch James.
          stack Michael Stack added a comment -

          Here is test that manufactures condition James sees. His patch fixes it (I just added DEBUG logging to his patch). I'm going to commit though I'm not going to include my test because of HBASE-3456 "Fix hardcoding of 20 second socket timeout down in HBaseClient
          hbase-issues". I don't want to add gratuitous 20 second wait to our test suite (not that anyone would notice the extra 20 seconds on top of an hour-plus suite).

          stack Michael Stack added a comment - Here is test that manufactures condition James sees. His patch fixes it (I just added DEBUG logging to his patch). I'm going to commit though I'm not going to include my test because of HBASE-3456 "Fix hardcoding of 20 second socket timeout down in HBaseClient hbase-issues". I don't want to add gratuitous 20 second wait to our test suite (not that anyone would notice the extra 20 seconds on top of an hour-plus suite).
          stack Michael Stack added a comment -

          I started in on a refactor of AM#unassign to move all of the try/catch out to a Class that could be reused in such as the CatalogTracker around this getCachedConnection where James ran into his issue. Turns out, this is wrong direction; the two locations have different exception throwing character. I'm abandoning this tack. Attaching patch anyway. Let me write a unit test to repro the James case to go along w/ his patch and see if I can gen other exceptions at the getCachedConnection juncture.

          stack Michael Stack added a comment - I started in on a refactor of AM#unassign to move all of the try/catch out to a Class that could be reused in such as the CatalogTracker around this getCachedConnection where James ran into his issue. Turns out, this is wrong direction; the two locations have different exception throwing character. I'm abandoning this tack. Attaching patch anyway. Let me write a unit test to repro the James case to go along w/ his patch and see if I can gen other exceptions at the getCachedConnection juncture.
          stack Michael Stack added a comment -

          Yeah. It starts to tend that direction James. I think the set that is over in AM is pretty good – its more prone to failures that the bit of code you've been massaging. Let me commit your patch.

          stack Michael Stack added a comment - Yeah. It starts to tend that direction James. I think the set that is over in AM is pretty good – its more prone to failures that the bit of code you've been massaging. Let me commit your patch.
          jk-public@troove.net James Kennedy added a comment -

          Yeah probably. I wonder if a better question is "what exceptions do we NOT want to catch so that master dies with a FATAL?"

          jk-public@troove.net James Kennedy added a comment - Yeah probably. I wonder if a better question is "what exceptions do we NOT want to catch so that master dies with a FATAL?"
          stack Michael Stack added a comment -

          Made James a contributor and assigned him this issue

          stack Michael Stack added a comment - Made James a contributor and assigned him this issue
          stack Michael Stack added a comment -

          Moved to 0.90.1

          stack Michael Stack added a comment - Moved to 0.90.1
          stack Michael Stack added a comment -

          James:

          In the AssignmentManager, where we go to RPC to a remote regionserver, we do following:

              } catch (ConnectException e) {
                LOG.info("Failed connect to " + server + ", message=" + e.getMessage() +
                  ", region=" + region.getEncodedName());
                // Presume that regionserver just failed and we haven't got expired
                // server from zk yet.  Let expired server deal with clean up.
              } catch (java.net.SocketTimeoutException e) {
                LOG.info("Server " + server + " returned " + e.getMessage() + " for " +
                  region.getEncodedName());
                // Presume retry or server will expire.
              } catch (EOFException e) {
                LOG.info("Server " + server + " returned " + e.getMessage() + " for " +
                  region.getEncodedName());
                // Presume retry or server will expire.
              } catch (RemoteException re) {
                IOException ioe = re.unwrapRemoteException();
                if (ioe instanceof NotServingRegionException) {
                  // Failed to close, so pass through and reassign
                  LOG.debug("Server " + server + " returned " + ioe + " for " +
                    region.getEncodedName());
                } else if (ioe instanceof EOFException) {
                  // Failed to close, so pass through and reassign
                  LOG.debug("Server " + server + " returned " + ioe + " for " +
                    region.getEncodedName());
                } else {
                  this.master.abort("Remote unexpected exception", ioe);
                }
              } catch (Throwable t) {
          

          I think your adding of timeout to the try/catch in the getCachedConnection is right. Maybe we should add the ConnectException too? Unless you object, I'll add it when I commit your patch.

          stack Michael Stack added a comment - James: In the AssignmentManager, where we go to RPC to a remote regionserver, we do following: } catch (ConnectException e) { LOG.info( "Failed connect to " + server + ", message=" + e.getMessage() + ", region=" + region.getEncodedName()); // Presume that regionserver just failed and we haven't got expired // server from zk yet. Let expired server deal with clean up. } catch (java.net.SocketTimeoutException e) { LOG.info( "Server " + server + " returned " + e.getMessage() + " for " + region.getEncodedName()); // Presume retry or server will expire. } catch (EOFException e) { LOG.info( "Server " + server + " returned " + e.getMessage() + " for " + region.getEncodedName()); // Presume retry or server will expire. } catch (RemoteException re) { IOException ioe = re.unwrapRemoteException(); if (ioe instanceof NotServingRegionException) { // Failed to close, so pass through and reassign LOG.debug( "Server " + server + " returned " + ioe + " for " + region.getEncodedName()); } else if (ioe instanceof EOFException) { // Failed to close, so pass through and reassign LOG.debug( "Server " + server + " returned " + ioe + " for " + region.getEncodedName()); } else { this .master.abort( "Remote unexpected exception" , ioe); } } catch (Throwable t) { I think your adding of timeout to the try/catch in the getCachedConnection is right. Maybe we should add the ConnectException too? Unless you object, I'll add it when I commit your patch.
          jk-public@troove.net James Kennedy added a comment -

          Actually, let me qualify that last statement. By "swallow" i didn't mean to imply that the exceptions should be completely silent. In fact some WARN output in that CatalogTracker exception handling would make sense.

          Something like:

          "Unable to connect to .meta region at 192.168.1.2:60020. Waiting for RegionServers to update location data."

          jk-public@troove.net James Kennedy added a comment - Actually, let me qualify that last statement. By "swallow" i didn't mean to imply that the exceptions should be completely silent. In fact some WARN output in that CatalogTracker exception handling would make sense. Something like: "Unable to connect to .meta region at 192.168.1.2:60020. Waiting for RegionServers to update location data."
          ryanobjc ryan rawson added a comment -

          thanks for the good debugging work. I'm going to place this in 0.90.1, and someone will review it soon.

          ryanobjc ryan rawson added a comment - thanks for the good debugging work. I'm going to place this in 0.90.1, and someone will review it soon.
          jk-public@troove.net James Kennedy added a comment -

          Instead of wrestling with a test I did some debugging. I can fix this issue with the attached patch.
          I'll leave it up to you guys to decide if that's the right fix or if there are more exceptions to be considered, etc.

          But from my narrow scope of understanding it just seems that the CatalogTracker SHOULD swallow exceptions like SocketTimeoutException instead of throwing them up.

          jk-public@troove.net James Kennedy added a comment - Instead of wrestling with a test I did some debugging. I can fix this issue with the attached patch. I'll leave it up to you guys to decide if that's the right fix or if there are more exceptions to be considered, etc. But from my narrow scope of understanding it just seems that the CatalogTracker SHOULD swallow exceptions like SocketTimeoutException instead of throwing them up.

          People

            stack Michael Stack
            jk-public@troove.net James Kennedy
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: