HBase
  1. HBase
  2. HBASE-3660

HMaster will exit when starting with stale data in cached locations such as -ROOT- or .META.

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.90.1
    • Fix Version/s: 0.90.2
    • Component/s: master, regionserver
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      later edit: I've mixed up two issues here. The main problem is that a client (that could be HMaster) will read stale data from ROOT or .META. and not deal correctly with the raised exceptions.

      I've noticed this when the IP on my machine changed (it's even easier to detect when LZO doesn't work)

      Master loads .META. successfully and then starts assigning regions.
      However LZO doesn't work so HRegionServer can't open the regions.
      A client attempts to get data from a table so it reads the location from .META. but goes to a totally different server (the old value in .META.)

      This could happen without the LZO story too.

      1. HBASE-3660.patch
        1 kB
        Cosmin Lehene
      2. 3660.txt
        2 kB
        stack

        Activity

        Show
        stack added a comment - Confirmation that this patch fixes the above issue beyond Cosmin's thumbs-up can be found here: http://search-hadoop.com/m/lsm232yCTxf/oRouteToHostException+causes+Master+abort+when+the+RegionServer+hosting+ROOT+is+not+available%2522&subj=NoRouteToHostException+causes+Master+abort+when+the+RegionServer+hosting+ROOT+is+not+available
        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK #1814 (See https://hudson.apache.org/hudson/job/HBase-TRUNK/1814/)

        Show
        Hudson added a comment - Integrated in HBase-TRUNK #1814 (See https://hudson.apache.org/hudson/job/HBase-TRUNK/1814/ )
        Hide
        stack added a comment -

        Applied branch and trunk. Thanks for review Cosmin (I took a look for other places that could have SocketException and this is what I came up with but yeah, do other researching in another issue – good on you boss)

        Show
        stack added a comment - Applied branch and trunk. Thanks for review Cosmin (I took a look for other places that could have SocketException and this is what I came up with but yeah, do other researching in another issue – good on you boss)
        Hide
        Cosmin Lehene added a comment -

        Hey Stack,

        Sorry for the delay. The patch looks right. Let's go with it.

        I'll try to review other instances of SocketException subclass usage and follow up, should we have an issue for this in case I'll be late with this?

        Show
        Cosmin Lehene added a comment - Hey Stack, Sorry for the delay. The patch looks right. Let's go with it. I'll try to review other instances of SocketException subclass usage and follow up, should we have an issue for this in case I'll be late with this?
        Hide
        stack added a comment -

        Review please (Cosmin?)

        Show
        stack added a comment - Review please (Cosmin?)
        Hide
        stack added a comment -

        @Cosmin, like this?

        Show
        stack added a comment - @Cosmin, like this?
        Hide
        Cosmin Lehene added a comment -

        Changed the issue Summary since the problem was elsewhere.

        Show
        Cosmin Lehene added a comment - Changed the issue Summary since the problem was elsewhere.
        Hide
        Cosmin Lehene added a comment -

        Also there are other places where we seem to catch only SocketTimeoutException. Reviewing them might be a good idea.

        Show
        Cosmin Lehene added a comment - Also there are other places where we seem to catch only SocketTimeoutException. Reviewing them might be a good idea.
        Hide
        Cosmin Lehene added a comment -

        I just looked over it (it's really annoying for me as my IP changes a lot).

        It looks like we catch too narrow in CatalogTracker.getCachedConnection (SocketTimeoutException)
        "Host is down" or "Network unreachable" are raised as SocketException.

        2011-03-22 15:13:19,111 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception. Starting shutdown.
        java.net.SocketException: Host is down
        	at sun.nio.ch.Net.connect(Native Method)
        	at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:507)
        	at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192)
        	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:408)
        	at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:328)
        	at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:883)
        	at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:750)
        	at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
        	at $Proxy7.getProtocolVersion(Unknown Source)
        	at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:419)
        	at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:393)
        	at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:444)
        	at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:349)
        	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:953)
        	at org.apache.hadoop.hbase.catalog.CatalogTracker.getCachedConnection(CatalogTracker.java:385)
        	at org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:284)
        	at org.apache.hadoop.hbase.catalog.CatalogTracker.verifyMetaRegionLocation(CatalogTracker.java:482)
        	at org.apache.hadoop.hbase.master.HMaster.assignRootAndMeta(HMaster.java:441)
        	at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:388)
        	at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:283)
        

        I changed it to catch SocketException and don't have any problems when changing IPs anymore.

        Show
        Cosmin Lehene added a comment - I just looked over it (it's really annoying for me as my IP changes a lot). It looks like we catch too narrow in CatalogTracker.getCachedConnection (SocketTimeoutException) "Host is down" or "Network unreachable" are raised as SocketException. 2011-03-22 15:13:19,111 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception. Starting shutdown. java.net.SocketException: Host is down at sun.nio.ch.Net.connect(Native Method) at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:507) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:408) at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:328) at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:883) at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:750) at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257) at $Proxy7.getProtocolVersion(Unknown Source) at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:419) at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:393) at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:444) at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:349) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:953) at org.apache.hadoop.hbase.catalog.CatalogTracker.getCachedConnection(CatalogTracker.java:385) at org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:284) at org.apache.hadoop.hbase.catalog.CatalogTracker.verifyMetaRegionLocation(CatalogTracker.java:482) at org.apache.hadoop.hbase.master.HMaster.assignRootAndMeta(HMaster.java:441) at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:388) at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:283) I changed it to catch SocketException and don't have any problems when changing IPs anymore.
        Hide
        stack added a comment -

        Thanks Cosmin for the detail. Lets fix for 0.90.2.

        Show
        stack added a comment - Thanks Cosmin for the detail. Lets fix for 0.90.2.
        Hide
        Cosmin Lehene added a comment -

        LZO not working would indeed be a bigger problem.
        However I mentioned it (LZO) because it was easier to spot that way, but it's not necessary to cause the problem.

        The questions is: is it ok when a region is unavailable to have clients contacting other region servers? I was thinking this could lead to other problems. The solution I was thinking about was not to remove the old server address from .META. but to mark that the region is not actually deployed.

        I'm seeing this on my laptop when I switch networks. I retested a network switch.
        Shutdown everything in network A (192.168.2.0)
        Start everything (including ZK and HDFS) in network B (10.131.171.0)

        When starting HBase I get this:

        in HMaster:

        2011-03-18 11:40:38,953 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: hlog file splitting completed in 7944 ms for hdfs://localhost:9000/hbase/.logs/192.168.2.102,60020,1300389033686
        2011-03-18 11:40:58,998 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting to server: 192.168.2.102/192.168.2.102:60020
        2011-03-18 11:41:20,000 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting to server: 192.168.2.102/192.168.2.102:60020
        2011-03-18 11:41:25,163 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception. Starting shutdown.
        java.net.SocketException: Network is unreachable
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)

        Then it shuts down.

        In HRegionServer

        2011-03-18 11:39:24,138 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Attempting connect to Master server at 192.168.2.102:60000
        2011-03-18 11:39:44,172 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting to server: 192.168.2.102/192.168.2.102:60000
        2011-03-18 11:40:05,172 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting to server: 192.168.2.102/192.168.2.102:60000
        2011-03-18 11:40:26,174 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting to server: 192.168.2.102/192.168.2.102:60000
        2011-03-18 11:40:26,175 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to connect to master. Retrying. Error was:
        java.net.SocketTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=192.168.2.102/192.168.2.102:60000]
        at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:213)
        ...

        2011-03-18 11:40:29,180 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Attempting connect to Master server at 10.131.171.219:60000
        2011-03-18 11:40:29,297 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at 10.131.171.219:60000
        2011-03-18 11:40:29,300 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Telling master at 10.131.171.219:60000 that we are up
        2011-03-18 11:40:29,329 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Master passed us address to use. Was=10.131.171.219:60020, Now=10.131.171.219:60020
        2011-03-18 11:40:29,331 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: Config from master: fs.default.name=hdfs://localhost:9000/hbase

        ...

        2011-03-18 11:40:30,784 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 9 on 60020: starting
        2011-03-18 11:40:30,784 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Serving as 10.131.171.219,60020,1300441163636, RPC listening on /10.131.171.219:60020, sessionid=0x12ec85503600002
        2011-03-18 11:40:30,795 INFO org.apache.hadoop.hbase.regionserver.StoreFile: Allocating LruBlockCache with maximum size 199.2m
        2011-03-18 11:41:27,876 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: No master found, will retry

        Since HMaster is dead I start it again:

        011-03-18 12:04:32,863 INFO org.apache.hadoop.hbase.master.ServerManager: Waiting on regionserver(s) count to settle; currently=1
        2011-03-18 12:04:34,364 INFO org.apache.hadoop.hbase.master.ServerManager: Finished waiting for regionserver count to settle; count=1, sleptFor=4500
        2011-03-18 12:04:34,364 INFO org.apache.hadoop.hbase.master.ServerManager: Exiting wait on regionserver(s) to checkin; count=1, stopped=false, count of regions out on cluster=0
        2011-03-18 12:04:34,368 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://localhost:9000/hbase/.logs/10.131.171.219,60020,1300441163636 belongs to an existing region server
        2011-03-18 12:04:54,057 DEBUG org.apache.hadoop.hbase.client.MetaScanner: Scanning .META. starting at row= for max=2147483647 rows
        2011-03-18 12:04:54,063 DEBUG org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Lookedup root region location, connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@63e708b2; hsa=192.168.2.102:60020
        2011-03-18 12:04:54,390 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting to server: 192.168.2.102/192.168.2.102:60020
        2011-03-18 12:05:15,391 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting to server: 192.168.2.102/192.168.2.102:60020
        2011-03-18 12:05:36,392 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting to server: 192.168.2.102/192.168.2.102:60020
        2011-03-18 12:05:36,393 DEBUG org.apache.hadoop.hbase.catalog.CatalogTracker: Timed out connecting to 192.168.2.102:60020
        2011-03-18 12:05:36,394 INFO org.apache.hadoop.hbase.catalog.RootLocationEditor: Unsetting ROOT region location in ZooKeeper
        2011-03-18 12:05:36,409 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x12ec85503600004 Creating (or updating) unassigned node for 70236052 with OFFLINE state
        2011-03-18 12:05:36,424 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for ROOT,,0.70236052 so generated a random one; hri=ROOT,,0.70236052, src=, dest=10.131.171.219,60020,1300441163636; 1 (online=1, exclude=null) available servers
        2011-03-18 12:05:36,425 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region ROOT,,0.70236052 to 10.131.171.219,60020,1300441163636
        2011-03-18 12:05:36,425 DEBUG org.apache.hadoop.hbase.master.ServerManager: New connection to 10.131.171.219,60020,1300441163636
        2011-03-18 12:05:56,395 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting to server: 192.168.2.102/192.168.2.102:60020
        [[B[[B2011-03-18 12:06:08,899 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: ROOT,,0.70236052 state=PENDING_OPEN, ts=1300442736425
        2011-03-18 12:06:08,901 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_OPEN for too long, reassigning region=ROOT,,0.70236052
        2011-03-18 12:06:08,901 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; was=ROOT,,0.70236052 state=PENDING_OPEN, ts=1300442736425
        2011-03-18 12:06:17,397 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting to server: 192.168.2.102/192.168.2.102:60020
        2011-03-18 12:06:38,399 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting to server: 192.168.2.102/192.168.2.102:60020

        ...

        2011-03-18 12:06:57,814 DEBUG org.apache.hadoop.hbase.client.MetaScanner: Scanning .META. starting at row= for max=2147483647 rows
        2011-03-18 12:06:57,817 DEBUG org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Lookedup root region location, connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@63e708b2; hsa=10.131.171.219:60020
        2011-03-18 12:06:58,051 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception. Starting shutdown.
        java.net.SocketException: Network is unreachable
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)

        HMaster kills itself again. Stopping the regionserver and starting it again with HMAster will yield the same results.
        And so on. At some point after a few restarts it will start and work (at least until you change IPs again)

        It's not clear (to me) if the stale data is in .META. or if it could be in ZK as well.

        My point is that this is not a LZO issue.

        Show
        Cosmin Lehene added a comment - LZO not working would indeed be a bigger problem. However I mentioned it (LZO) because it was easier to spot that way, but it's not necessary to cause the problem. The questions is: is it ok when a region is unavailable to have clients contacting other region servers? I was thinking this could lead to other problems. The solution I was thinking about was not to remove the old server address from .META. but to mark that the region is not actually deployed. I'm seeing this on my laptop when I switch networks. I retested a network switch. Shutdown everything in network A (192.168.2.0) Start everything (including ZK and HDFS) in network B (10.131.171.0) When starting HBase I get this: in HMaster: 2011-03-18 11:40:38,953 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: hlog file splitting completed in 7944 ms for hdfs://localhost:9000/hbase/.logs/192.168.2.102,60020,1300389033686 2011-03-18 11:40:58,998 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting to server: 192.168.2.102/192.168.2.102:60020 2011-03-18 11:41:20,000 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting to server: 192.168.2.102/192.168.2.102:60020 2011-03-18 11:41:25,163 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception. Starting shutdown. java.net.SocketException: Network is unreachable at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) Then it shuts down. In HRegionServer 2011-03-18 11:39:24,138 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Attempting connect to Master server at 192.168.2.102:60000 2011-03-18 11:39:44,172 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting to server: 192.168.2.102/192.168.2.102:60000 2011-03-18 11:40:05,172 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting to server: 192.168.2.102/192.168.2.102:60000 2011-03-18 11:40:26,174 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting to server: 192.168.2.102/192.168.2.102:60000 2011-03-18 11:40:26,175 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to connect to master. Retrying. Error was: java.net.SocketTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel [connection-pending remote=192.168.2.102/192.168.2.102:60000] at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:213) ... 2011-03-18 11:40:29,180 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Attempting connect to Master server at 10.131.171.219:60000 2011-03-18 11:40:29,297 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at 10.131.171.219:60000 2011-03-18 11:40:29,300 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Telling master at 10.131.171.219:60000 that we are up 2011-03-18 11:40:29,329 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Master passed us address to use. Was=10.131.171.219:60020, Now=10.131.171.219:60020 2011-03-18 11:40:29,331 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: Config from master: fs.default.name=hdfs://localhost:9000/hbase ... 2011-03-18 11:40:30,784 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 9 on 60020: starting 2011-03-18 11:40:30,784 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Serving as 10.131.171.219,60020,1300441163636, RPC listening on /10.131.171.219:60020, sessionid=0x12ec85503600002 2011-03-18 11:40:30,795 INFO org.apache.hadoop.hbase.regionserver.StoreFile: Allocating LruBlockCache with maximum size 199.2m 2011-03-18 11:41:27,876 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: No master found, will retry Since HMaster is dead I start it again: 011-03-18 12:04:32,863 INFO org.apache.hadoop.hbase.master.ServerManager: Waiting on regionserver(s) count to settle; currently=1 2011-03-18 12:04:34,364 INFO org.apache.hadoop.hbase.master.ServerManager: Finished waiting for regionserver count to settle; count=1, sleptFor=4500 2011-03-18 12:04:34,364 INFO org.apache.hadoop.hbase.master.ServerManager: Exiting wait on regionserver(s) to checkin; count=1, stopped=false, count of regions out on cluster=0 2011-03-18 12:04:34,368 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://localhost:9000/hbase/.logs/10.131.171.219,60020,1300441163636 belongs to an existing region server 2011-03-18 12:04:54,057 DEBUG org.apache.hadoop.hbase.client.MetaScanner: Scanning .META. starting at row= for max=2147483647 rows 2011-03-18 12:04:54,063 DEBUG org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Lookedup root region location, connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@63e708b2; hsa=192.168.2.102:60020 2011-03-18 12:04:54,390 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting to server: 192.168.2.102/192.168.2.102:60020 2011-03-18 12:05:15,391 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting to server: 192.168.2.102/192.168.2.102:60020 2011-03-18 12:05:36,392 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting to server: 192.168.2.102/192.168.2.102:60020 2011-03-18 12:05:36,393 DEBUG org.apache.hadoop.hbase.catalog.CatalogTracker: Timed out connecting to 192.168.2.102:60020 2011-03-18 12:05:36,394 INFO org.apache.hadoop.hbase.catalog.RootLocationEditor: Unsetting ROOT region location in ZooKeeper 2011-03-18 12:05:36,409 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x12ec85503600004 Creating (or updating) unassigned node for 70236052 with OFFLINE state 2011-03-18 12:05:36,424 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for ROOT ,,0.70236052 so generated a random one; hri= ROOT ,,0.70236052, src=, dest=10.131.171.219,60020,1300441163636; 1 (online=1, exclude=null) available servers 2011-03-18 12:05:36,425 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region ROOT ,,0.70236052 to 10.131.171.219,60020,1300441163636 2011-03-18 12:05:36,425 DEBUG org.apache.hadoop.hbase.master.ServerManager: New connection to 10.131.171.219,60020,1300441163636 2011-03-18 12:05:56,395 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting to server: 192.168.2.102/192.168.2.102:60020 [[B [[B2011-03-18 12:06:08,899 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: ROOT ,,0.70236052 state=PENDING_OPEN, ts=1300442736425 2011-03-18 12:06:08,901 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_OPEN for too long, reassigning region= ROOT ,,0.70236052 2011-03-18 12:06:08,901 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; was= ROOT ,,0.70236052 state=PENDING_OPEN, ts=1300442736425 2011-03-18 12:06:17,397 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting to server: 192.168.2.102/192.168.2.102:60020 2011-03-18 12:06:38,399 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting to server: 192.168.2.102/192.168.2.102:60020 ... 2011-03-18 12:06:57,814 DEBUG org.apache.hadoop.hbase.client.MetaScanner: Scanning .META. starting at row= for max=2147483647 rows 2011-03-18 12:06:57,817 DEBUG org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Lookedup root region location, connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@63e708b2; hsa=10.131.171.219:60020 2011-03-18 12:06:58,051 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception. Starting shutdown. java.net.SocketException: Network is unreachable at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) HMaster kills itself again. Stopping the regionserver and starting it again with HMAster will yield the same results. And so on. At some point after a few restarts it will start and work (at least until you change IPs again) It's not clear (to me) if the stale data is in .META. or if it could be in ZK as well. My point is that this is not a LZO issue.
        Hide
        stack added a comment -

        But the region is not deployed right Cosmin? And can't if LZO is borked? Isn't this the bigger problem?

        The .META. will be updated on successful region deploy? We keep the old .META. data around because we want to assign regions to their old location on restart (because of locality).

        Show
        stack added a comment - But the region is not deployed right Cosmin? And can't if LZO is borked? Isn't this the bigger problem? The .META. will be updated on successful region deploy? We keep the old .META. data around because we want to assign regions to their old location on restart (because of locality).

          People

          • Assignee:
            stack
            Reporter:
            Cosmin Lehene
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development