HBase
  1. HBase
  2. HBASE-8912

[0.94] AssignmentManager throws IllegalStateException from PENDING_OPEN to OFFLINE

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.94.16
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      AM throws this exception which subsequently causes the master to abort:

      java.lang.IllegalStateException: Unexpected state : testRetrying,jjj,1372891751115.9b828792311001062a5ff4b1038fe33b. state=PENDING_OPEN, ts=1372891751912, server=hemera.apache.org,39064,1372891746132 .. Cannot transit it to OFFLINE.
      	at org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1879)
      	at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1688)
      	at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1424)
      	at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1399)
      	at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1394)
      	at org.apache.hadoop.hbase.master.handler.ClosedRegionHandler.process(ClosedRegionHandler.java:105)
      	at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
      	at java.lang.Thread.run(Thread.java:662)
      

      This exception trace is from the failing test TestMetaReaderEditor which is failing pretty frequently, but looking at the test code, I think this is not a test-only issue, but affects the main code path.

      https://builds.apache.org/job/HBase-0.94/1036/testReport/junit/org.apache.hadoop.hbase.catalog/TestMetaReaderEditor/testRetrying/

      1. 8912-0.94.txt
        0.6 kB
        Lars Hofhansl
      2. 8912-0.94-alt2.txt
        0.8 kB
        Lars Hofhansl
      3. 8912-fix-race.txt
        2 kB
        Lars Hofhansl
      4. HBase-0.94 #1036 test - testRetrying [Jenkins].html
        978 kB
        Enis Soztutar
      5. HBASE-8912.patch
        18 kB
        ramkrishna.s.vasudevan
      6. log.txt
        36 kB
        Jean-Marc Spaggiari
      7. org.apache.hadoop.hbase.catalog.TestMetaReaderEditor-output.txt
        1.17 MB
        Lars Hofhansl

        Activity

        Hide
        Enis Soztutar added a comment -

        Attaching the logs in case jenkins deletes them.

        Show
        Enis Soztutar added a comment - Attaching the logs in case jenkins deletes them.
        Hide
        Lars Hofhansl added a comment -

        Any further ideas, Enis Soztutar?
        Moving to 0.94.11 for now, since this is does not appear to be a new issue.

        Show
        Lars Hofhansl added a comment - Any further ideas, Enis Soztutar ? Moving to 0.94.11 for now, since this is does not appear to be a new issue.
        Hide
        Enis Soztutar added a comment -

        Yes, this does not seem to be new. Although there was a recent user post about the same stack. I was waiting for discussing this with Jeffrey Zhong a bit more since I am not very good at AM. Our assignment manager masters are welcome to chime in.

        Show
        Enis Soztutar added a comment - Yes, this does not seem to be new. Although there was a recent user post about the same stack. I was waiting for discussing this with Jeffrey Zhong a bit more since I am not very good at AM. Our assignment manager masters are welcome to chime in.
        Hide
        Vladimir Rodionov added a comment -

        This seems to a serious issue in 0.94.x

        This is thread in user group:
        http://mail-archives.apache.org/mod_mbox/hbase-user/201307.mbox/%3CEE3F98CB-A4E8-4BFF-8C5F-AC50E164EB0D%40gmail.com%3E

        This one I started today in dev group:
        http://mail-archives.apache.org/mod_mbox/hbase-dev/201307.mbox/%3CDC5EBE7F3610EB4CA5C7E92D76873E1517ECF1DCBD%40exchange2007.carrieriq.com%3E

        We have one cluster in QA which has had this issue several times already. The only way to fix it is to remove all data. Today I got this issue in our large dev cluster.

        Show
        Vladimir Rodionov added a comment - This seems to a serious issue in 0.94.x This is thread in user group: http://mail-archives.apache.org/mod_mbox/hbase-user/201307.mbox/%3CEE3F98CB-A4E8-4BFF-8C5F-AC50E164EB0D%40gmail.com%3E This one I started today in dev group: http://mail-archives.apache.org/mod_mbox/hbase-dev/201307.mbox/%3CDC5EBE7F3610EB4CA5C7E92D76873E1517ECF1DCBD%40exchange2007.carrieriq.com%3E We have one cluster in QA which has had this issue several times already. The only way to fix it is to remove all data. Today I got this issue in our large dev cluster.
        Hide
        Vladimir Rodionov added a comment -

        One scenario, when this java.lang.IllegalStateException: Unexpected state becomes possible:

        Try to create table with pre-splits and unsupported codec (in my case it was SNAPPY). For example:

        create 'usertable', { NAME=>'cf', VERSIONS=> 1, COMPRESSION => 'SNAPPY', BLOCKCACHE => true}, { SPLITS => ['user', 'user05', 'user1','user15','user2','user25','user3','user35','user4','user45','user5','user55','user6','user65','user7','user75','user8','user85','user9','user95' ]}
        

        Master aborts.

        Show
        Vladimir Rodionov added a comment - One scenario, when this java.lang.IllegalStateException: Unexpected state becomes possible: Try to create table with pre-splits and unsupported codec (in my case it was SNAPPY). For example: create 'usertable', { NAME=>'cf', VERSIONS=> 1, COMPRESSION => 'SNAPPY', BLOCKCACHE => true }, { SPLITS => ['user', 'user05', 'user1','user15','user2','user25','user3','user35','user4','user45','user5','user55','user6','user65','user7','user75','user8','user85','user9','user95' ]} Master aborts.
        Hide
        Lars Hofhansl added a comment -

        No patch. Need to move on.

        Show
        Lars Hofhansl added a comment - No patch. Need to move on.
        Hide
        Lars Hofhansl added a comment -

        What should we do with this?

        Show
        Lars Hofhansl added a comment - What should we do with this?
        Hide
        Enis Soztutar added a comment -

        I think this is still very important to fix. I can take a look at it next week or so, unless some AM expert wants to chime in and relieve me from the burden.

        Show
        Enis Soztutar added a comment - I think this is still very important to fix. I can take a look at it next week or so, unless some AM expert wants to chime in and relieve me from the burden.
        Hide
        ramkrishna.s.vasudevan added a comment -

        I can check it next week if you are Okie with it Enis?

        Show
        ramkrishna.s.vasudevan added a comment - I can check it next week if you are Okie with it Enis?
        Hide
        Lars Hofhansl added a comment -

        Moving to 0.94.13

        Show
        Lars Hofhansl added a comment - Moving to 0.94.13
        Hide
        Lars Hofhansl added a comment -

        Some folks on the mailing list had something like this happening to them.

        Show
        Lars Hofhansl added a comment - Some folks on the mailing list had something like this happening to them.
        Show
        Vladimir Rodionov added a comment - I posted how to reproduce this issue. https://issues.apache.org/jira/browse/HBASE-8912?focusedCommentId=13721821&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13721821
        Hide
        Samir Ahmic added a comment -

        Look like there are multiple scenarios for triggering that AssignmentManager throws IllegalStateException from PENDING_OPEN to OFFLINE. Here is my case (hbase-0.94.6.1): we have updated configuration on two RS (5 total in cluster) to wrong value of hbase.client.keyvalue.maxsize (instead of -1 value was set to 1) after restarting cluster regionservers with wrong value started to throwing exception like this:

        2013-10-14 06:53:22,267 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
        Exception running postOpenDeployTasks; region=59badb0e2a41e7831162654227d32049
        java.lang.IllegalArgumentException: KeyValue size too large
        

        which led to:

        2013-10-14 06:53:22,271 INFO org.apache.hadoop.hbase.regionserver.HRegion: Closed third_party
        _storages,,1342430453242.59badb0e2a41e7831162654227d32049.
        

        At same time AssignmentManger tried to reassign problematic region to other regionservers and after one more failed attempt finally hit server with correct value of hbase.client.keyvalue.maxsize and here is relevant log from that RS:

        2013-10-14 06:53:26,390 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x13e31052be39645 Attempting to transition node 59badb0e2a41e7831162654227d32049 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING
        2013-10-14 06:53:26,397 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x13e31052be39645 Successfully transitioned node 59badb0e2a41e7831162654227d32049 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING
        ........................
        
        2013-10-14 06:53:26,459 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x13e31052be39645 Successfully transitioned node 59badb0e2a41e7831162654227d32049 from RS_ZK_
        REGION_OPENING to RS_ZK_REGION_OPENED
        

        At same time AssignmetManager throws this exception and aborted:

        2013-10-14 06:53:26,400 FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected state : third_party_storages,,1342430453242.59badb0e
        2a41e7831162654227d32049. state=PENDING_OPEN, ts=1381748006399, server=rsdfw-10-177-161-197,60020,1381747996145.. Cannot transit it to OFFLINE.
        java.lang.IllegalStateException: Unexpected state : third_party_storages,,1342430453242.59badb0e2a41e7831162654227d32049. state=PENDING_OPEN, ts=1381748006399, server=rsdfw-10-177-161-197.internal.personal.com,60020,1381747996145 .. Cannot transit it to OFFLINE.
                at org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1820)
                at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1659)
                at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1424)
                at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1399)
                at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1394)
                at org.apache.hadoop.hbase.master.handler.ClosedRegionHandler.process(ClosedRegionHandler.java:105)
                at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175)
                at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
                at java.lang.Thread.run(Thread.java:619)
        2013-10-14 06:53:26,400 INFO org.apache.hadoop.hbase.master.HMaster: Aborting
        

        Hope this will help.

        Show
        Samir Ahmic added a comment - Look like there are multiple scenarios for triggering that AssignmentManager throws IllegalStateException from PENDING_OPEN to OFFLINE. Here is my case (hbase-0.94.6.1): we have updated configuration on two RS (5 total in cluster) to wrong value of hbase.client.keyvalue.maxsize (instead of -1 value was set to 1) after restarting cluster regionservers with wrong value started to throwing exception like this: 2013-10-14 06:53:22,267 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Exception running postOpenDeployTasks; region=59badb0e2a41e7831162654227d32049 java.lang.IllegalArgumentException: KeyValue size too large which led to: 2013-10-14 06:53:22,271 INFO org.apache.hadoop.hbase.regionserver.HRegion: Closed third_party _storages,,1342430453242.59badb0e2a41e7831162654227d32049. At same time AssignmentManger tried to reassign problematic region to other regionservers and after one more failed attempt finally hit server with correct value of hbase.client.keyvalue.maxsize and here is relevant log from that RS: 2013-10-14 06:53:26,390 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x13e31052be39645 Attempting to transition node 59badb0e2a41e7831162654227d32049 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING 2013-10-14 06:53:26,397 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x13e31052be39645 Successfully transitioned node 59badb0e2a41e7831162654227d32049 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING ........................ 2013-10-14 06:53:26,459 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x13e31052be39645 Successfully transitioned node 59badb0e2a41e7831162654227d32049 from RS_ZK_ REGION_OPENING to RS_ZK_REGION_OPENED At same time AssignmetManager throws this exception and aborted: 2013-10-14 06:53:26,400 FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected state : third_party_storages,,1342430453242.59badb0e 2a41e7831162654227d32049. state=PENDING_OPEN, ts=1381748006399, server=rsdfw-10-177-161-197,60020,1381747996145.. Cannot transit it to OFFLINE. java.lang.IllegalStateException: Unexpected state : third_party_storages,,1342430453242.59badb0e2a41e7831162654227d32049. state=PENDING_OPEN, ts=1381748006399, server=rsdfw-10-177-161-197.internal.personal.com,60020,1381747996145 .. Cannot transit it to OFFLINE. at org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1820) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1659) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1424) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1399) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1394) at org.apache.hadoop.hbase.master.handler.ClosedRegionHandler.process(ClosedRegionHandler.java:105) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang. Thread .run( Thread .java:619) 2013-10-14 06:53:26,400 INFO org.apache.hadoop.hbase.master.HMaster: Aborting Hope this will help.
        Hide
        Vladimir Rodionov added a comment -

        I think that the pattern is the same. If RS fails/aborts when AssignmentManager is active we can get into the situation when one or more Regions are hanging in state=PENDING_OPEN.

        Show
        Vladimir Rodionov added a comment - I think that the pattern is the same. If RS fails/aborts when AssignmentManager is active we can get into the situation when one or more Regions are hanging in state=PENDING_OPEN.
        Hide
        Lars Hofhansl added a comment -

        Moving to 0.94.14

        Show
        Lars Hofhansl added a comment - Moving to 0.94.14
        Hide
        Sergey Kirichenko added a comment -

        May be this helps (HBase from cloudera - 0.94.6-cdh4.4.0):

        grep by region caused exception on master:

        2013-10-31 00:07:52,871 WARN org.apache.hadoop.hbase.master.AssignmentManager: Region 3a476d37da81f620a3e53179d7d9192b has null regionLocation. But its table table_x isn't in ENABLING state.
        2013-10-31 00:07:53,057 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x242045137a20070 Async create of unassigned node for 3a476d37da81f620a3e53179d7d9192b with OFFLINE state
        2013-10-31 00:07:53,467 DEBUG org.apache.hadoop.hbase.master.AssignmentManager$CreateUnassignedAsyncCallback: rs=table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b. state=OFFLINE, ts=1383163673057, server=null, server=xxx100,60020,1383163665902
        2013-10-31 00:07:53,495 DEBUG org.apache.hadoop.hbase.master.AssignmentManager$ExistsUnassignedAsyncCallback: rs=table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b. state=OFFLINE, ts=1383163673057, server=null
        2013-10-31 00:07:54,834 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=xxx100,60020,1383163665902, region=3a476d37da81f620a3e53179d7d9192b
        2013-10-31 00:07:56,953 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_FAILED_OPEN, server=xxx100,60020,1383163665902, region=3a476d37da81f620a3e53179d7d9192b
        2013-10-31 00:07:56,953 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Found an existing plan for table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b. destination server is xxx100,60020,1383163665902
        2013-10-31 00:07:56,953 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b. so generated a random one; hri=table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b., src=, dest=xxx108,60020,1383163666006; 9 (online=9, available=8) available servers
        2013-10-31 00:07:56,955 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 3a476d37da81f620a3e53179d7d9192b
        2013-10-31 00:07:56,956 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; was=table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b. state=CLOSED, ts=1383163675624, server=xxx100,60020,1383163665902
        2013-10-31 00:07:56,956 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x242045137a20070 Creating (or updating) unassigned node for 3a476d37da81f620a3e53179d7d9192b with OFFLINE state
        2013-10-31 00:07:57,003 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Found an existing plan for table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b. destination server is xxx108,60020,1383163666006
        2013-10-31 00:07:57,003 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b.; plan=hri=table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b., src=, dest=xxx108,60020,1383163666006
        2013-10-31 00:07:57,003 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b. to xxx108,60020,1383163666006
        2013-10-31 00:07:58,545 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_FAILED_OPEN, server=xxx108,60020,1383163666006, region=3a476d37da81f620a3e53179d7d9192b
        2013-10-31 00:07:58,545 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Found an existing plan for table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b. destination server is xxx108,60020,1383163666006
        2013-10-31 00:07:58,545 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b. so generated a random one; hri=table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b., src=, dest=xxx106,60020,1383163666003; 9 (online=9, available=8) available servers
        2013-10-31 00:07:58,546 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 3a476d37da81f620a3e53179d7d9192b
        2013-10-31 00:07:58,546 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; was=table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b. state=CLOSED, ts=1383163677110, server=xxx108,60020,1383163666006
        2013-10-31 00:07:58,546 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x242045137a20070 Creating (or updating) unassigned node for 3a476d37da81f620a3e53179d7d9192b with OFFLINE state
        2013-10-31 00:07:58,553 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_FAILED_OPEN, server=xxx108,60020,1383163666006, region=3a476d37da81f620a3e53179d7d9192b
        2013-10-31 00:07:58,554 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Found an existing plan for table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b. destination server is xxx106,60020,1383163666003
        2013-10-31 00:07:58,554 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b. so generated a random one; hri=table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b., src=, dest=xxx104,60020,1383163665976; 9 (online=9, available=8) available servers
        2013-10-31 00:07:58,554 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 3a476d37da81f620a3e53179d7d9192b
        2013-10-31 00:07:58,554 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; was=table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b. state=CLOSED, ts=1383163677110, server=xxx108,60020,1383163666006
        2013-10-31 00:07:58,571 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Found an existing plan for table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b. destination server is xxx104,60020,1383163665976
        2013-10-31 00:07:58,571 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b.; plan=hri=table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b., src=, dest=xxx104,60020,1383163665976
        2013-10-31 00:07:58,571 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b. to xxx104,60020,1383163665976
        2013-10-31 00:07:58,595 FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected state : table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b. state=PENDING_OPEN, ts=1383163678594, server=xxx104,60020,1383163665976 .. Cannot transit it to OFFLINE.
        java.lang.IllegalStateException: Unexpected state : table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b. state=PENDING_OPEN, ts=1383163678594, server=xxx104,60020,1383163665976 .. Cannot transit it to OFFLINE.
                at org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1831)
                at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1661)
                at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1426)
                at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1398)
                at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1393)
                at org.apache.hadoop.hbase.master.handler.ClosedRegionHandler.process(ClosedRegionHandler.java:105)
                at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175)
                at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
                at java.lang.Thread.run(Thread.java:662)
        

        grep by region caused exception on xxx100:

        2013-10-31 00:07:54,000 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b.
        2013-10-31 00:07:54,000 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x242045137a20071 Attempting to transition node 3a476d37da81f620a3e53179d7d9192b from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING
        2013-10-31 00:07:54,029 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x242045137a20071 Successfully transitioned node 3a476d37da81f620a3e53179d7d9192b from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING
        2013-10-31 00:07:55,439 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Opening region: {NAME => 'table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b.', STARTKEY => '6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8', ENDKEY => '71c71c71c71c71c71c71c71c71c71c71c71c71c0', ENCODED => 3a476d37da81f620a3e53179d7d9192b,}
        2013-10-31 00:07:55,439 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Instantiated table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b.
        2013-10-31 00:07:55,447 DEBUG org.apache.hadoop.hbase.regionserver.StoreFile: Store file hdfs://xxx-master/hbase/table_x/3a476d37da81f620a3e53179d7d9192b/A/table_y=abca20169d0fdd22f2a13d6caf41d83d-0729d4dbd652443bbd44381db5b7b26b is a link
        2013-10-31 00:07:55,501 DEBUG org.apache.hadoop.hbase.regionserver.Store: loaded hdfs://xxx-master/hbase/table_x/3a476d37da81f620a3e53179d7d9192b/A/table_y=abca20169d0fdd22f2a13d6caf41d83d-0729d4dbd652443bbd44381db5b7b26b, isReference=false, isBulkLoadResult=false, seqid=46816, majorCompaction=true
        2013-10-31 00:07:55,546 DEBUG org.apache.hadoop.hbase.regionserver.StoreFile: Store file hdfs://xxx-master/hbase/table_x/3a476d37da81f620a3e53179d7d9192b/B/table_y=abca20169d0fdd22f2a13d6caf41d83d-8606885898507153833 is a link
        2013-10-31 00:07:55,602 DEBUG org.apache.hadoop.hbase.regionserver.StoreFile: Store file hdfs://xxx-master/hbase/table_x/3a476d37da81f620a3e53179d7d9192b/B/table_y=abca20169d0fdd22f2a13d6caf41d83d-b2449476f78e42b7ba0bba2ac69a24b8 is a link
        2013-10-31 00:07:55,613 DEBUG org.apache.hadoop.hbase.regionserver.Store: loaded hdfs://xxx-master/hbase/table_x/3a476d37da81f620a3e53179d7d9192b/B/table_y=abca20169d0fdd22f2a13d6caf41d83d-b2449476f78e42b7ba0bba2ac69a24b8, isReference=false, isBulkLoadResult=false, seqid=46816, majorCompaction=false
        2013-10-31 00:07:55,618 ERROR org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed open of region=table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b., starting to roll back the global memstore size.
        2013-10-31 00:07:55,621 INFO org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Opening of region {NAME => 'table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b.', STARTKEY => '6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8', ENDKEY => '71c71c71c71c71c71c71c71c71c71c71c71c71c0', ENCODED => 3a476d37da81f620a3e53179d7d9192b,} failed, marking as FAILED_OPEN in ZK
        2013-10-31 00:07:55,621 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x242045137a20071 Attempting to transition node 3a476d37da81f620a3e53179d7d9192b from RS_ZK_REGION_OPENING to RS_ZK_REGION_FAILED_OPEN
        2013-10-31 00:07:55,630 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x242045137a20071 Successfully transitioned node 3a476d37da81f620a3e53179d7d9192b from RS_ZK_REGION_OPENING to RS_ZK_REGION_FAILED_OPEN
        

        grep by region caused exception on xxx108:

        2013-10-31 00:07:57,003 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b.
        2013-10-31 00:07:57,010 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x242045137a20074 Attempting to transition node 3a476d37da81f620a3e53179d7d9192b from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING
        2013-10-31 00:07:57,042 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x242045137a20074 Successfully transitioned node 3a476d37da81f620a3e53179d7d9192b from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING
        2013-10-31 00:07:57,043 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Opening region: {NAME => 'table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b.', STARTKEY => '6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8', ENDKEY => '71c71c71c71c71c71c71c71c71c71c71c71c71c0', ENCODED => 3a476d37da81f620a3e53179d7d9192b,}
        2013-10-31 00:07:57,043 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Instantiated table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b.
        2013-10-31 00:07:57,049 DEBUG org.apache.hadoop.hbase.regionserver.StoreFile: Store file hdfs://xxx-master/hbase/table_x/3a476d37da81f620a3e53179d7d9192b/A/table_y=abca20169d0fdd22f2a13d6caf41d83d-0729d4dbd652443bbd44381db5b7b26b is a link
        2013-10-31 00:07:57,060 DEBUG org.apache.hadoop.hbase.regionserver.Store: loaded hdfs://xxx-master/hbase/table_x/3a476d37da81f620a3e53179d7d9192b/A/table_y=abca20169d0fdd22f2a13d6caf41d83d-0729d4dbd652443bbd44381db5b7b26b, isReference=false, isBulkLoadResult=false, seqid=46816, majorCompaction=true
        2013-10-31 00:07:57,065 DEBUG org.apache.hadoop.hbase.regionserver.StoreFile: Store file hdfs://xxx-master/hbase/table_x/3a476d37da81f620a3e53179d7d9192b/B/table_y=abca20169d0fdd22f2a13d6caf41d83d-8606885898507153833 is a link
        2013-10-31 00:07:57,095 DEBUG org.apache.hadoop.hbase.regionserver.StoreFile: Store file hdfs://xxx-master/hbase/table_x/3a476d37da81f620a3e53179d7d9192b/B/table_y=abca20169d0fdd22f2a13d6caf41d83d-b2449476f78e42b7ba0bba2ac69a24b8 is a link
        2013-10-31 00:07:57,105 DEBUG org.apache.hadoop.hbase.regionserver.Store: loaded hdfs://xxx-master/hbase/table_x/3a476d37da81f620a3e53179d7d9192b/B/table_y=abca20169d0fdd22f2a13d6caf41d83d-b2449476f78e42b7ba0bba2ac69a24b8, isReference=false, isBulkLoadResult=false, seqid=46816, majorCompaction=false
        2013-10-31 00:07:57,107 ERROR org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed open of region=table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b., starting to roll back the global memstore size.
        2013-10-31 00:07:57,108 INFO org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Opening of region {NAME => 'table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b.', STARTKEY => '6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8', ENDKEY => '71c71c71c71c71c71c71c71c71c71c71c71c71c0', ENCODED => 3a476d37da81f620a3e53179d7d9192b,} failed, marking as FAILED_OPEN in ZK
        2013-10-31 00:07:57,108 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x242045137a20074 Attempting to transition node 3a476d37da81f620a3e53179d7d9192b from RS_ZK_REGION_OPENING to RS_ZK_REGION_FAILED_OPEN
        2013-10-31 00:07:57,125 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x242045137a20074 Successfully transitioned node 3a476d37da81f620a3e53179d7d9192b from RS_ZK_REGION_OPENING to RS_ZK_REGION_FAILED_OPEN
        

        grep by region caused exception on xxx104:

        2013-10-31 00:07:58,581 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b.
        2013-10-31 00:07:58,587 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x420451326a0070 Attempting to transition node 3a476d37da81f620a3e53179d7d9192b from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING
        2013-10-31 00:07:58,602 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x420451326a0070 Successfully transitioned node 3a476d37da81f620a3e53179d7d9192b from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING
        2013-10-31 00:07:58,603 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Opening region: {NAME => 'table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b.', STARTKEY => '6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8', ENDKEY => '71c71c71c71c71c71c71c71c71c71c71c71c71c0', ENCODED => 3a476d37da81f620a3e53179d7d9192b,}
        2013-10-31 00:07:58,604 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Instantiated table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b.
        2013-10-31 00:07:58,610 DEBUG org.apache.hadoop.hbase.regionserver.StoreFile: Store file hdfs://xxx-master/hbase/table_x/3a476d37da81f620a3e53179d7d9192b/A/table_y=abca20169d0fdd22f2a13d6caf41d83d-0729d4dbd652443bbd44381db5b7b26b is a link
        2013-10-31 00:07:58,621 DEBUG org.apache.hadoop.hbase.regionserver.Store: loaded hdfs://xxx-master/hbase/table_x/3a476d37da81f620a3e53179d7d9192b/A/table_y=abca20169d0fdd22f2a13d6caf41d83d-0729d4dbd652443bbd44381db5b7b26b, isReference=false, isBulkLoadResult=false, seqid=46816, majorCompaction=true
        2013-10-31 00:07:58,627 DEBUG org.apache.hadoop.hbase.regionserver.StoreFile: Store file hdfs://xxx-master/hbase/table_x/3a476d37da81f620a3e53179d7d9192b/B/table_y=abca20169d0fdd22f2a13d6caf41d83d-8606885898507153833 is a link
        2013-10-31 00:07:58,639 DEBUG org.apache.hadoop.hbase.regionserver.StoreFile: Store file hdfs://xxx-master/hbase/table_x/3a476d37da81f620a3e53179d7d9192b/B/table_y=abca20169d0fdd22f2a13d6caf41d83d-b2449476f78e42b7ba0bba2ac69a24b8 is a link
        2013-10-31 00:07:58,650 DEBUG org.apache.hadoop.hbase.regionserver.Store: loaded hdfs://xxx-master/hbase/table_x/3a476d37da81f620a3e53179d7d9192b/B/table_y=abca20169d0fdd22f2a13d6caf41d83d-b2449476f78e42b7ba0bba2ac69a24b8, isReference=false, isBulkLoadResult=false, seqid=46816, majorCompaction=false
        2013-10-31 00:07:58,652 ERROR org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed open of region=table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b., starting to roll back the global memstore size.
        2013-10-31 00:07:58,653 INFO org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Opening of region {NAME => 'table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b.', STARTKEY => '6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8', ENDKEY => '71c71c71c71c71c71c71c71c71c71c71c71c71c0', ENCODED => 3a476d37da81f620a3e53179d7d9192b,} failed, marking as FAILED_OPEN in ZK
        2013-10-31 00:07:58,653 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x420451326a0070 Attempting to transition node 3a476d37da81f620a3e53179d7d9192b from RS_ZK_REGION_OPENING to RS_ZK_REGION_FAILED_OPEN
        2013-10-31 00:07:58,670 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x420451326a0070 Successfully transitioned node 3a476d37da81f620a3e53179d7d9192b from RS_ZK_REGION_OPENING to RS_ZK_REGION_FAILED_OPEN
        

        1) Initially AM try to assign region with 'bulk assign' on xxx100 (PENDING_OPEN => RS_ZK_REGION_OPENING); but xxx100 failed to open region and AM handles this event (RS_ZK_REGION_FAILED_OPEN => CLOSED => OFFLINE)
        2) AM try to assign region in ClosedRegionHandler on xxx108 (there is no RS_ZK_REGION_OPENING event in master's logs, but we see it in regionserver's logs); it fails again
        3) AM chose xxx106 for region assignment but receives RS_ZK_REGION_FAILED_OPEN before sending request => CLOSED => ClosedRegionHandler => xxx104 => exception

        Show
        Sergey Kirichenko added a comment - May be this helps (HBase from cloudera - 0.94.6-cdh4.4.0): grep by region caused exception on master: 2013-10-31 00:07:52,871 WARN org.apache.hadoop.hbase.master.AssignmentManager: Region 3a476d37da81f620a3e53179d7d9192b has null regionLocation. But its table table_x isn't in ENABLING state. 2013-10-31 00:07:53,057 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x242045137a20070 Async create of unassigned node for 3a476d37da81f620a3e53179d7d9192b with OFFLINE state 2013-10-31 00:07:53,467 DEBUG org.apache.hadoop.hbase.master.AssignmentManager$CreateUnassignedAsyncCallback: rs=table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b. state=OFFLINE, ts=1383163673057, server=null, server=xxx100,60020,1383163665902 2013-10-31 00:07:53,495 DEBUG org.apache.hadoop.hbase.master.AssignmentManager$ExistsUnassignedAsyncCallback: rs=table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b. state=OFFLINE, ts=1383163673057, server=null 2013-10-31 00:07:54,834 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=xxx100,60020,1383163665902, region=3a476d37da81f620a3e53179d7d9192b 2013-10-31 00:07:56,953 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_FAILED_OPEN, server=xxx100,60020,1383163665902, region=3a476d37da81f620a3e53179d7d9192b 2013-10-31 00:07:56,953 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Found an existing plan for table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b. destination server is xxx100,60020,1383163665902 2013-10-31 00:07:56,953 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b. so generated a random one; hri=table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b., src=, dest=xxx108,60020,1383163666006; 9 (online=9, available=8) available servers 2013-10-31 00:07:56,955 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 3a476d37da81f620a3e53179d7d9192b 2013-10-31 00:07:56,956 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; was=table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b. state=CLOSED, ts=1383163675624, server=xxx100,60020,1383163665902 2013-10-31 00:07:56,956 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x242045137a20070 Creating (or updating) unassigned node for 3a476d37da81f620a3e53179d7d9192b with OFFLINE state 2013-10-31 00:07:57,003 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Found an existing plan for table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b. destination server is xxx108,60020,1383163666006 2013-10-31 00:07:57,003 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b.; plan=hri=table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b., src=, dest=xxx108,60020,1383163666006 2013-10-31 00:07:57,003 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b. to xxx108,60020,1383163666006 2013-10-31 00:07:58,545 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_FAILED_OPEN, server=xxx108,60020,1383163666006, region=3a476d37da81f620a3e53179d7d9192b 2013-10-31 00:07:58,545 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Found an existing plan for table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b. destination server is xxx108,60020,1383163666006 2013-10-31 00:07:58,545 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b. so generated a random one; hri=table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b., src=, dest=xxx106,60020,1383163666003; 9 (online=9, available=8) available servers 2013-10-31 00:07:58,546 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 3a476d37da81f620a3e53179d7d9192b 2013-10-31 00:07:58,546 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; was=table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b. state=CLOSED, ts=1383163677110, server=xxx108,60020,1383163666006 2013-10-31 00:07:58,546 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x242045137a20070 Creating (or updating) unassigned node for 3a476d37da81f620a3e53179d7d9192b with OFFLINE state 2013-10-31 00:07:58,553 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_FAILED_OPEN, server=xxx108,60020,1383163666006, region=3a476d37da81f620a3e53179d7d9192b 2013-10-31 00:07:58,554 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Found an existing plan for table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b. destination server is xxx106,60020,1383163666003 2013-10-31 00:07:58,554 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b. so generated a random one; hri=table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b., src=, dest=xxx104,60020,1383163665976; 9 (online=9, available=8) available servers 2013-10-31 00:07:58,554 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 3a476d37da81f620a3e53179d7d9192b 2013-10-31 00:07:58,554 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; was=table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b. state=CLOSED, ts=1383163677110, server=xxx108,60020,1383163666006 2013-10-31 00:07:58,571 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Found an existing plan for table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b. destination server is xxx104,60020,1383163665976 2013-10-31 00:07:58,571 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b.; plan=hri=table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b., src=, dest=xxx104,60020,1383163665976 2013-10-31 00:07:58,571 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b. to xxx104,60020,1383163665976 2013-10-31 00:07:58,595 FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected state : table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b. state=PENDING_OPEN, ts=1383163678594, server=xxx104,60020,1383163665976 .. Cannot transit it to OFFLINE. java.lang.IllegalStateException: Unexpected state : table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b. state=PENDING_OPEN, ts=1383163678594, server=xxx104,60020,1383163665976 .. Cannot transit it to OFFLINE. at org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1831) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1661) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1426) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1398) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1393) at org.apache.hadoop.hbase.master.handler.ClosedRegionHandler.process(ClosedRegionHandler.java:105) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:662) grep by region caused exception on xxx100: 2013-10-31 00:07:54,000 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b. 2013-10-31 00:07:54,000 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x242045137a20071 Attempting to transition node 3a476d37da81f620a3e53179d7d9192b from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING 2013-10-31 00:07:54,029 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x242045137a20071 Successfully transitioned node 3a476d37da81f620a3e53179d7d9192b from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING 2013-10-31 00:07:55,439 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Opening region: {NAME => 'table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b.', STARTKEY => '6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8', ENDKEY => '71c71c71c71c71c71c71c71c71c71c71c71c71c0', ENCODED => 3a476d37da81f620a3e53179d7d9192b,} 2013-10-31 00:07:55,439 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Instantiated table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b. 2013-10-31 00:07:55,447 DEBUG org.apache.hadoop.hbase.regionserver.StoreFile: Store file hdfs://xxx-master/hbase/table_x/3a476d37da81f620a3e53179d7d9192b/A/table_y=abca20169d0fdd22f2a13d6caf41d83d-0729d4dbd652443bbd44381db5b7b26b is a link 2013-10-31 00:07:55,501 DEBUG org.apache.hadoop.hbase.regionserver.Store: loaded hdfs://xxx-master/hbase/table_x/3a476d37da81f620a3e53179d7d9192b/A/table_y=abca20169d0fdd22f2a13d6caf41d83d-0729d4dbd652443bbd44381db5b7b26b, isReference=false, isBulkLoadResult=false, seqid=46816, majorCompaction=true 2013-10-31 00:07:55,546 DEBUG org.apache.hadoop.hbase.regionserver.StoreFile: Store file hdfs://xxx-master/hbase/table_x/3a476d37da81f620a3e53179d7d9192b/B/table_y=abca20169d0fdd22f2a13d6caf41d83d-8606885898507153833 is a link 2013-10-31 00:07:55,602 DEBUG org.apache.hadoop.hbase.regionserver.StoreFile: Store file hdfs://xxx-master/hbase/table_x/3a476d37da81f620a3e53179d7d9192b/B/table_y=abca20169d0fdd22f2a13d6caf41d83d-b2449476f78e42b7ba0bba2ac69a24b8 is a link 2013-10-31 00:07:55,613 DEBUG org.apache.hadoop.hbase.regionserver.Store: loaded hdfs://xxx-master/hbase/table_x/3a476d37da81f620a3e53179d7d9192b/B/table_y=abca20169d0fdd22f2a13d6caf41d83d-b2449476f78e42b7ba0bba2ac69a24b8, isReference=false, isBulkLoadResult=false, seqid=46816, majorCompaction=false 2013-10-31 00:07:55,618 ERROR org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed open of region=table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b., starting to roll back the global memstore size. 2013-10-31 00:07:55,621 INFO org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Opening of region {NAME => 'table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b.', STARTKEY => '6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8', ENDKEY => '71c71c71c71c71c71c71c71c71c71c71c71c71c0', ENCODED => 3a476d37da81f620a3e53179d7d9192b,} failed, marking as FAILED_OPEN in ZK 2013-10-31 00:07:55,621 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x242045137a20071 Attempting to transition node 3a476d37da81f620a3e53179d7d9192b from RS_ZK_REGION_OPENING to RS_ZK_REGION_FAILED_OPEN 2013-10-31 00:07:55,630 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x242045137a20071 Successfully transitioned node 3a476d37da81f620a3e53179d7d9192b from RS_ZK_REGION_OPENING to RS_ZK_REGION_FAILED_OPEN grep by region caused exception on xxx108: 2013-10-31 00:07:57,003 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b. 2013-10-31 00:07:57,010 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x242045137a20074 Attempting to transition node 3a476d37da81f620a3e53179d7d9192b from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING 2013-10-31 00:07:57,042 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x242045137a20074 Successfully transitioned node 3a476d37da81f620a3e53179d7d9192b from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING 2013-10-31 00:07:57,043 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Opening region: {NAME => 'table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b.', STARTKEY => '6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8', ENDKEY => '71c71c71c71c71c71c71c71c71c71c71c71c71c0', ENCODED => 3a476d37da81f620a3e53179d7d9192b,} 2013-10-31 00:07:57,043 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Instantiated table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b. 2013-10-31 00:07:57,049 DEBUG org.apache.hadoop.hbase.regionserver.StoreFile: Store file hdfs://xxx-master/hbase/table_x/3a476d37da81f620a3e53179d7d9192b/A/table_y=abca20169d0fdd22f2a13d6caf41d83d-0729d4dbd652443bbd44381db5b7b26b is a link 2013-10-31 00:07:57,060 DEBUG org.apache.hadoop.hbase.regionserver.Store: loaded hdfs://xxx-master/hbase/table_x/3a476d37da81f620a3e53179d7d9192b/A/table_y=abca20169d0fdd22f2a13d6caf41d83d-0729d4dbd652443bbd44381db5b7b26b, isReference=false, isBulkLoadResult=false, seqid=46816, majorCompaction=true 2013-10-31 00:07:57,065 DEBUG org.apache.hadoop.hbase.regionserver.StoreFile: Store file hdfs://xxx-master/hbase/table_x/3a476d37da81f620a3e53179d7d9192b/B/table_y=abca20169d0fdd22f2a13d6caf41d83d-8606885898507153833 is a link 2013-10-31 00:07:57,095 DEBUG org.apache.hadoop.hbase.regionserver.StoreFile: Store file hdfs://xxx-master/hbase/table_x/3a476d37da81f620a3e53179d7d9192b/B/table_y=abca20169d0fdd22f2a13d6caf41d83d-b2449476f78e42b7ba0bba2ac69a24b8 is a link 2013-10-31 00:07:57,105 DEBUG org.apache.hadoop.hbase.regionserver.Store: loaded hdfs://xxx-master/hbase/table_x/3a476d37da81f620a3e53179d7d9192b/B/table_y=abca20169d0fdd22f2a13d6caf41d83d-b2449476f78e42b7ba0bba2ac69a24b8, isReference=false, isBulkLoadResult=false, seqid=46816, majorCompaction=false 2013-10-31 00:07:57,107 ERROR org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed open of region=table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b., starting to roll back the global memstore size. 2013-10-31 00:07:57,108 INFO org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Opening of region {NAME => 'table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b.', STARTKEY => '6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8', ENDKEY => '71c71c71c71c71c71c71c71c71c71c71c71c71c0', ENCODED => 3a476d37da81f620a3e53179d7d9192b,} failed, marking as FAILED_OPEN in ZK 2013-10-31 00:07:57,108 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x242045137a20074 Attempting to transition node 3a476d37da81f620a3e53179d7d9192b from RS_ZK_REGION_OPENING to RS_ZK_REGION_FAILED_OPEN 2013-10-31 00:07:57,125 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x242045137a20074 Successfully transitioned node 3a476d37da81f620a3e53179d7d9192b from RS_ZK_REGION_OPENING to RS_ZK_REGION_FAILED_OPEN grep by region caused exception on xxx104: 2013-10-31 00:07:58,581 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b. 2013-10-31 00:07:58,587 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x420451326a0070 Attempting to transition node 3a476d37da81f620a3e53179d7d9192b from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING 2013-10-31 00:07:58,602 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x420451326a0070 Successfully transitioned node 3a476d37da81f620a3e53179d7d9192b from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING 2013-10-31 00:07:58,603 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Opening region: {NAME => 'table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b.', STARTKEY => '6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8', ENDKEY => '71c71c71c71c71c71c71c71c71c71c71c71c71c0', ENCODED => 3a476d37da81f620a3e53179d7d9192b,} 2013-10-31 00:07:58,604 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Instantiated table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b. 2013-10-31 00:07:58,610 DEBUG org.apache.hadoop.hbase.regionserver.StoreFile: Store file hdfs://xxx-master/hbase/table_x/3a476d37da81f620a3e53179d7d9192b/A/table_y=abca20169d0fdd22f2a13d6caf41d83d-0729d4dbd652443bbd44381db5b7b26b is a link 2013-10-31 00:07:58,621 DEBUG org.apache.hadoop.hbase.regionserver.Store: loaded hdfs://xxx-master/hbase/table_x/3a476d37da81f620a3e53179d7d9192b/A/table_y=abca20169d0fdd22f2a13d6caf41d83d-0729d4dbd652443bbd44381db5b7b26b, isReference=false, isBulkLoadResult=false, seqid=46816, majorCompaction=true 2013-10-31 00:07:58,627 DEBUG org.apache.hadoop.hbase.regionserver.StoreFile: Store file hdfs://xxx-master/hbase/table_x/3a476d37da81f620a3e53179d7d9192b/B/table_y=abca20169d0fdd22f2a13d6caf41d83d-8606885898507153833 is a link 2013-10-31 00:07:58,639 DEBUG org.apache.hadoop.hbase.regionserver.StoreFile: Store file hdfs://xxx-master/hbase/table_x/3a476d37da81f620a3e53179d7d9192b/B/table_y=abca20169d0fdd22f2a13d6caf41d83d-b2449476f78e42b7ba0bba2ac69a24b8 is a link 2013-10-31 00:07:58,650 DEBUG org.apache.hadoop.hbase.regionserver.Store: loaded hdfs://xxx-master/hbase/table_x/3a476d37da81f620a3e53179d7d9192b/B/table_y=abca20169d0fdd22f2a13d6caf41d83d-b2449476f78e42b7ba0bba2ac69a24b8, isReference=false, isBulkLoadResult=false, seqid=46816, majorCompaction=false 2013-10-31 00:07:58,652 ERROR org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed open of region=table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b., starting to roll back the global memstore size. 2013-10-31 00:07:58,653 INFO org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Opening of region {NAME => 'table_x,6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8,1382557475861.3a476d37da81f620a3e53179d7d9192b.', STARTKEY => '6eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee8', ENDKEY => '71c71c71c71c71c71c71c71c71c71c71c71c71c0', ENCODED => 3a476d37da81f620a3e53179d7d9192b,} failed, marking as FAILED_OPEN in ZK 2013-10-31 00:07:58,653 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x420451326a0070 Attempting to transition node 3a476d37da81f620a3e53179d7d9192b from RS_ZK_REGION_OPENING to RS_ZK_REGION_FAILED_OPEN 2013-10-31 00:07:58,670 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x420451326a0070 Successfully transitioned node 3a476d37da81f620a3e53179d7d9192b from RS_ZK_REGION_OPENING to RS_ZK_REGION_FAILED_OPEN 1) Initially AM try to assign region with 'bulk assign' on xxx100 (PENDING_OPEN => RS_ZK_REGION_OPENING); but xxx100 failed to open region and AM handles this event (RS_ZK_REGION_FAILED_OPEN => CLOSED => OFFLINE) 2) AM try to assign region in ClosedRegionHandler on xxx108 (there is no RS_ZK_REGION_OPENING event in master's logs, but we see it in regionserver's logs); it fails again 3) AM chose xxx106 for region assignment but receives RS_ZK_REGION_FAILED_OPEN before sending request => CLOSED => ClosedRegionHandler => xxx104 => exception
        Hide
        Lars Hofhansl added a comment -

        Still no patch. Maybe I'll get to this before 0.94.14, but until then moving to 0.94.15.

        Show
        Lars Hofhansl added a comment - Still no patch. Maybe I'll get to this before 0.94.14, but until then moving to 0.94.15.
        Hide
        chenjiajun added a comment -

        2013-11-27 18:26:18,102 FATAL org.apache.hadoop.hbase.master.HMaster: Master server abort: loaded coprocessors are: []
        2013-11-27 18:26:18,102 FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected state : H,http://istock.jrj.com.cn/article,002024,6567377.html,1385541132079.18c9cb11b3e673dec07038f166fb3ef7. state=PENDING_O
        PEN, ts=1385547978102, server=d199.uuc.com,60020,1385047501649 .. Cannot transit it to OFFLINE.
        java.lang.IllegalStateException: Unexpected state : H,http://istock.jrj.com.cn/article,002024,6567377.html,1385541132079.18c9cb11b3e673dec07038f166fb3ef7. state=PENDING_OPEN, ts=1385547978102, server=d199.uuc.com,60020,1385047501649 .. Cannot transit it to OFFLINE.
        at org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1890)
        at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1690)
        at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1426)
        at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1398)
        at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1393)
        at org.apache.hadoop.hbase.master.handler.ClosedRegionHandler.process(ClosedRegionHandler.java:105)
        at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
        2013-11-27 18:26:18,104 INFO org.apache.hadoop.hbase.master.HMaster: Aborting
        2013-11-27 18:26:18,104 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server on 60000
        2013-11-27 18:26:18,104 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 1 on 60000: exiting
        ......

        RS 's log:
        2013-11-27 18:24:33,375 INFO org.apache.hadoop.hbase.regionserver.StoreFile: Delete Family Bloom filter type for hdfs://master.uc.uuc.com:9000/hbase/H/18c9cb11b3e673dec07038f166fb3ef7/.tmp/832ec249071c45b3934a186046ca429d: CompoundBloomFilterWriter
        2013-11-27 18:24:33,385 INFO org.apache.hadoop.hbase.regionserver.StoreFile: NO General Bloom and NO DeleteFamily was added to HFile (hdfs://master.uc.uuc.com:9000/hbase/H/18c9cb11b3e673dec07038f166fb3ef7/.tmp/832ec249071c45b3934a186046ca429d)
        2013-11-27 18:24:33,388 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming compacted file at hdfs://master.uc.uuc.com:9000/hbase/H/18c9cb11b3e673dec07038f166fb3ef7/.tmp/832ec249071c45b3934a186046ca429d to hdfs://master.uc.uuc.com:9000/hbase/H/18c9cb11b3e673dec07038f166fb3ef7/page/832ec249071c45b3934a186046ca429d
        2013-11-27 18:24:33,421 INFO org.apache.hadoop.hbase.regionserver.Store: Completed compaction of 3 file(s) in page of H,http://istock.jrj.com.cn/article,002024,6567377.html,1385541132079.18c9cb11b3e673dec07038f166fb3ef7. into 832ec249071c45b3934a186046ca429d, size=85.7k; total size for store is 533.2m
        2013-11-27 18:24:33,422 INFO org.apache.hadoop.hbase.regionserver.compactions.CompactionRequest: completed compaction: regionName=H,http://istock.jrj.com.cn/article,002024,6567377.html,1385541132079.18c9cb11b3e673dec07038f166fb3ef7., storeName=page, fileCount=3, fileSize=91.8k, priority=3, time=1230511587222436; duration=0sec
        2013-11-27 18:24:37,189 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region DocSortByHour,\x00\x00\x011\xAE\xCB\xD8 \x00\x00,B\x00\x00\x05\x84\x00\x01\xE6\xAF\x00\x02\x09\x0D\x00\x00\x00\x00\x06\xB5c\xB3,1314093149893.c5e2ce4602fb62714c72cd8da3b50bd5. after a delay of 5137
        2013-11-27 18:24:43,658 INFO org.apache.hadoop.hbase.util.FSUtils: FileSystem doesn't support getDefaultReplication
        2013-11-27 18:24:43,658 INFO org.apache.hadoop.hbase.util.FSUtils: FileSystem doesn't support getDefaultBlockSize
        2013-11-27 18:24:43,681 INFO org.apache.hadoop.hbase.regionserver.StoreFile: Delete Family Bloom filter type for hdfs://master.uc.uuc.com:9000/hbase/DocSortByHour/c5e2ce4602fb62714c72cd8da3b50bd5/.tmp/dd7c3908111d4fcc8945b2b6d5291364: CompoundBloomFilterWriter
        2013-11-27 18:24:43,691 INFO org.apache.hadoop.hbase.regionserver.StoreFile: NO General Bloom and NO DeleteFamily was added to HFile (hdfs://master.uc.uuc.com:9000/hbase/DocSortByHour/c5e2ce4602fb62714c72cd8da3b50bd5/.tmp/dd7c3908111d4fcc8945b2b6d5291364)
        2013-11-27 18:24:43,691 INFO org.apache.hadoop.hbase.regionserver.Store: Flushed , sequenceid=26736748922, memsize=832.0, into tmp file hdfs://master.uc.uuc.com:9000/hbase/DocSortByHour/c5e2ce4602fb62714c72cd8da3b50bd5/.tmp/dd7c3908111d4fcc8945b2b6d5291364
        2013-11-27 18:24:43,699 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://master.uc.uuc.com:9000/hbase/DocSortByHour/c5e2ce4602fb62714c72cd8da3b50bd5/data/dd7c3908111d4fcc8945b2b6d5291364, entries=4, sequenceid=26736748922, filesize=1000.0
        2013-11-27 18:24:43,700 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~832.0/832, currentsize=0.0/0 for region DocSortByHour,\x00\x00\x011\xAE\xCB\xD8 \x00\x00,B\x00\x00\x05\x84\x00\x01\xE6\xAF\x00\x02\x09\x0D\x00\x00\x00\x00\x06\xB5c\xB3,1314093149893.c5e2ce4602fb62714c72cd8da3b50bd5. in 44ms, sequenceid=26736748922, compaction requested=true
        2013-11-27 18:24:43,701 INFO org.apache.hadoop.hbase.regionserver.HRegion: Starting compaction on data in region DocSortByHour,\x00\x00\x011\xAE\xCB\xD8 \x00\x00,B\x00\x00\x05\x84\x00\x01\xE6\xAF\x00\x02\x09\x0D\x00\x00\x00\x00\x06\xB5c\xB3,1314093149893.c5e2ce4602fb62714c72cd8da3b50bd5.
        2013-11-27 18:24:43,701 INFO org.apache.hadoop.hbase.regionserver.Store: Starting compaction of 3 file(s) in data of DocSortByHour,\x00\x00\x011\xAE\xCB\xD8 \x00\x00,B\x00\x00\x05\x84\x00\x01\xE6\xAF\x00\x02\x09\x0D\x00\x00\x00\x00\x06\xB5c\xB3,1314093149893.c5e2ce4602fb62714c72cd8da3b50bd5. into tmpdir=hdfs://master.uc.uuc.com:9000/hbase/DocSortByHour/c5e2ce4602fb62714c72cd8da3b50bd5/.tmp, seqid=26736748922, totalSize=47.9m
        2013-11-27 18:24:43,755 INFO org.apache.hadoop.hbase.util.FSUtils: FileSystem doesn't support getDefaultReplication
        2013-11-27 18:24:43,756 INFO org.apache.hadoop.hbase.util.FSUtils: FileSystem doesn't support getDefaultBlockSize
        2013-11-27 18:24:43,757 INFO org.apache.hadoop.hbase.regionserver.StoreFile: Delete Family Bloom filter type for hdfs://master.uc.uuc.com:9000/hbase/DocSortByHour/c5e2ce4602fb62714c72cd8da3b50bd5/.tmp/c2047a8a560645af936bcbcfe2625e5e: CompoundBloomFilterWriter
        2013-11-27 18:24:47,354 INFO org.apache.hadoop.hbase.regionserver.StoreFile: NO General Bloom and DeleteFamily was added to HFile (hdfs://master.uc.uuc.com:9000/hbase/DocSortByHour/c5e2ce4602fb62714c72cd8da3b50bd5/.tmp/c2047a8a560645af936bcbcfe2625e5e)
        2013-11-27 18:24:47,359 INFO org.apache.hadoop.hbase.regionserver.StoreFile$Reader: Loaded Delete Family Bloom (CompoundBloomFilter) metadata for c2047a8a560645af936bcbcfe2625e5e
        2013-11-27 18:24:47,359 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming compacted file at hdfs://master.uc.uuc.com:9000/hbase/DocSortByHour/c5e2ce4602fb62714c72cd8da3b50bd5/.tmp/c2047a8a560645af936bcbcfe2625e5e to hdfs://master.uc.uuc.com:9000/hbase/DocSortByHour/c5e2ce4602fb62714c72cd8da3b50bd5/data/c2047a8a560645af936bcbcfe2625e5e
        2013-11-27 18:24:47,364 INFO org.apache.hadoop.hbase.regionserver.StoreFile$Reader: Loaded Delete Family Bloom (CompoundBloomFilter) metadata for c2047a8a560645af936bcbcfe2625e5e
        2013-11-27 18:24:47,404 INFO org.apache.hadoop.hbase.regionserver.Store: Completed compaction of 3 file(s) in data of DocSortByHour,\x00\x00\x011\xAE\xCB\xD8 \x00\x00,B\x00\x00\x05\x84\x00\x01\xE6\xAF\x00\x02\x09\x0D\x00\x00\x00\x00\x06\xB5c\xB3,1314093149893.c5e2ce4602fb62714c72cd8da3b50bd5. into c2047a8a560645af936bcbcfe2625e5e, size=47.9m; total size for store is 223.8m
        2013-11-27 18:24:47,404 INFO org.apache.hadoop.hbase.regionserver.compactions.CompactionRequest: completed compaction: regionName=DocSortByHour,\x00\x00\x011\xAE\xCB\xD8 \x00\x00,B\x00\x00\x05\x84\x00\x01\xE6\xAF\x00\x02\x09\x0D\x00\x00\x00\x00\x06\xB5c\xB3,1314093149893.c5e2ce4602fb62714c72cd8da3b50bd5., storeName=data, fileCount=3, fileSize=47.9m, priority=3, time=1230521954697044; duration=3sec
        2013-11-27 18:27:07,191 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region h,!\xACU\xD2,1375542268855.ae4d6fae898f79e9fa019b78c975dd95. after a delay of 10647
        2013-11-27 18:27:17,190 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region h,!\xACU\xD2,1375542268855.ae4d6fae898f79e9fa019b78c975dd95. after a delay of 19328
        2013-11-27 18:27:17,841 INFO org.apache.hadoop.hbase.util.FSUtils: FileSystem doesn't support getDefaultReplication
        2013-11-27 18:27:17,841 INFO org.apache.hadoop.hbase.util.FSUtils: FileSystem doesn't support getDefaultBlockSize
        2013-11-27 18:27:17,851 INFO org.apache.hadoop.hbase.regionserver.StoreFile: Delete Family Bloom filter type for hdfs://master.uc.uuc.com:9000/hbase/h/ae4d6fae898f79e9fa019b78c975dd95/.tmp/135b6223aa9a41c1bf3511c7a389baea: CompoundBloomFilterWriter
        2013-11-27 18:27:17,863 INFO org.apache.hadoop.hbase.regionserver.StoreFile: NO General Bloom and NO DeleteFamily was added to HFile (hdfs://master.uc.uuc.com:9000/hbase/h/ae4d6fae898f79e9fa019b78c975dd95/.tmp/135b6223aa9a41c1bf3511c7a389baea)

        ------
        and the Master worked well If I used version 0.94.3 ship with RS used version 0.94.13 .

        Show
        chenjiajun added a comment - 2013-11-27 18:26:18,102 FATAL org.apache.hadoop.hbase.master.HMaster: Master server abort: loaded coprocessors are: [] 2013-11-27 18:26:18,102 FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected state : H, http://istock.jrj.com.cn/article,002024,6567377.html,1385541132079.18c9cb11b3e673dec07038f166fb3ef7 . state=PENDING_O PEN, ts=1385547978102, server=d199.uuc.com,60020,1385047501649 .. Cannot transit it to OFFLINE. java.lang.IllegalStateException: Unexpected state : H, http://istock.jrj.com.cn/article,002024,6567377.html,1385541132079.18c9cb11b3e673dec07038f166fb3ef7 . state=PENDING_OPEN, ts=1385547978102, server=d199.uuc.com,60020,1385047501649 .. Cannot transit it to OFFLINE. at org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1890) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1690) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1426) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1398) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1393) at org.apache.hadoop.hbase.master.handler.ClosedRegionHandler.process(ClosedRegionHandler.java:105) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) 2013-11-27 18:26:18,104 INFO org.apache.hadoop.hbase.master.HMaster: Aborting 2013-11-27 18:26:18,104 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server on 60000 2013-11-27 18:26:18,104 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 1 on 60000: exiting ...... RS 's log: 2013-11-27 18:24:33,375 INFO org.apache.hadoop.hbase.regionserver.StoreFile: Delete Family Bloom filter type for hdfs://master.uc.uuc.com:9000/hbase/H/18c9cb11b3e673dec07038f166fb3ef7/.tmp/832ec249071c45b3934a186046ca429d: CompoundBloomFilterWriter 2013-11-27 18:24:33,385 INFO org.apache.hadoop.hbase.regionserver.StoreFile: NO General Bloom and NO DeleteFamily was added to HFile (hdfs://master.uc.uuc.com:9000/hbase/H/18c9cb11b3e673dec07038f166fb3ef7/.tmp/832ec249071c45b3934a186046ca429d) 2013-11-27 18:24:33,388 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming compacted file at hdfs://master.uc.uuc.com:9000/hbase/H/18c9cb11b3e673dec07038f166fb3ef7/.tmp/832ec249071c45b3934a186046ca429d to hdfs://master.uc.uuc.com:9000/hbase/H/18c9cb11b3e673dec07038f166fb3ef7/page/832ec249071c45b3934a186046ca429d 2013-11-27 18:24:33,421 INFO org.apache.hadoop.hbase.regionserver.Store: Completed compaction of 3 file(s) in page of H, http://istock.jrj.com.cn/article,002024,6567377.html,1385541132079.18c9cb11b3e673dec07038f166fb3ef7 . into 832ec249071c45b3934a186046ca429d, size=85.7k; total size for store is 533.2m 2013-11-27 18:24:33,422 INFO org.apache.hadoop.hbase.regionserver.compactions.CompactionRequest: completed compaction: regionName=H, http://istock.jrj.com.cn/article,002024,6567377.html,1385541132079.18c9cb11b3e673dec07038f166fb3ef7 ., storeName=page, fileCount=3, fileSize=91.8k, priority=3, time=1230511587222436; duration=0sec 2013-11-27 18:24:37,189 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region DocSortByHour,\x00\x00\x011\xAE\xCB\xD8 \x00\x00,B\x00\x00\x05\x84\x00\x01\xE6\xAF\x00\x02\x09\x0D\x00\x00\x00\x00\x06\xB5c\xB3,1314093149893.c5e2ce4602fb62714c72cd8da3b50bd5. after a delay of 5137 2013-11-27 18:24:43,658 INFO org.apache.hadoop.hbase.util.FSUtils: FileSystem doesn't support getDefaultReplication 2013-11-27 18:24:43,658 INFO org.apache.hadoop.hbase.util.FSUtils: FileSystem doesn't support getDefaultBlockSize 2013-11-27 18:24:43,681 INFO org.apache.hadoop.hbase.regionserver.StoreFile: Delete Family Bloom filter type for hdfs://master.uc.uuc.com:9000/hbase/DocSortByHour/c5e2ce4602fb62714c72cd8da3b50bd5/.tmp/dd7c3908111d4fcc8945b2b6d5291364: CompoundBloomFilterWriter 2013-11-27 18:24:43,691 INFO org.apache.hadoop.hbase.regionserver.StoreFile: NO General Bloom and NO DeleteFamily was added to HFile (hdfs://master.uc.uuc.com:9000/hbase/DocSortByHour/c5e2ce4602fb62714c72cd8da3b50bd5/.tmp/dd7c3908111d4fcc8945b2b6d5291364) 2013-11-27 18:24:43,691 INFO org.apache.hadoop.hbase.regionserver.Store: Flushed , sequenceid=26736748922, memsize=832.0, into tmp file hdfs://master.uc.uuc.com:9000/hbase/DocSortByHour/c5e2ce4602fb62714c72cd8da3b50bd5/.tmp/dd7c3908111d4fcc8945b2b6d5291364 2013-11-27 18:24:43,699 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://master.uc.uuc.com:9000/hbase/DocSortByHour/c5e2ce4602fb62714c72cd8da3b50bd5/data/dd7c3908111d4fcc8945b2b6d5291364, entries=4, sequenceid=26736748922, filesize=1000.0 2013-11-27 18:24:43,700 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~832.0/832, currentsize=0.0/0 for region DocSortByHour,\x00\x00\x011\xAE\xCB\xD8 \x00\x00,B\x00\x00\x05\x84\x00\x01\xE6\xAF\x00\x02\x09\x0D\x00\x00\x00\x00\x06\xB5c\xB3,1314093149893.c5e2ce4602fb62714c72cd8da3b50bd5. in 44ms, sequenceid=26736748922, compaction requested=true 2013-11-27 18:24:43,701 INFO org.apache.hadoop.hbase.regionserver.HRegion: Starting compaction on data in region DocSortByHour,\x00\x00\x011\xAE\xCB\xD8 \x00\x00,B\x00\x00\x05\x84\x00\x01\xE6\xAF\x00\x02\x09\x0D\x00\x00\x00\x00\x06\xB5c\xB3,1314093149893.c5e2ce4602fb62714c72cd8da3b50bd5. 2013-11-27 18:24:43,701 INFO org.apache.hadoop.hbase.regionserver.Store: Starting compaction of 3 file(s) in data of DocSortByHour,\x00\x00\x011\xAE\xCB\xD8 \x00\x00,B\x00\x00\x05\x84\x00\x01\xE6\xAF\x00\x02\x09\x0D\x00\x00\x00\x00\x06\xB5c\xB3,1314093149893.c5e2ce4602fb62714c72cd8da3b50bd5. into tmpdir=hdfs://master.uc.uuc.com:9000/hbase/DocSortByHour/c5e2ce4602fb62714c72cd8da3b50bd5/.tmp, seqid=26736748922, totalSize=47.9m 2013-11-27 18:24:43,755 INFO org.apache.hadoop.hbase.util.FSUtils: FileSystem doesn't support getDefaultReplication 2013-11-27 18:24:43,756 INFO org.apache.hadoop.hbase.util.FSUtils: FileSystem doesn't support getDefaultBlockSize 2013-11-27 18:24:43,757 INFO org.apache.hadoop.hbase.regionserver.StoreFile: Delete Family Bloom filter type for hdfs://master.uc.uuc.com:9000/hbase/DocSortByHour/c5e2ce4602fb62714c72cd8da3b50bd5/.tmp/c2047a8a560645af936bcbcfe2625e5e: CompoundBloomFilterWriter 2013-11-27 18:24:47,354 INFO org.apache.hadoop.hbase.regionserver.StoreFile: NO General Bloom and DeleteFamily was added to HFile (hdfs://master.uc.uuc.com:9000/hbase/DocSortByHour/c5e2ce4602fb62714c72cd8da3b50bd5/.tmp/c2047a8a560645af936bcbcfe2625e5e) 2013-11-27 18:24:47,359 INFO org.apache.hadoop.hbase.regionserver.StoreFile$Reader: Loaded Delete Family Bloom (CompoundBloomFilter) metadata for c2047a8a560645af936bcbcfe2625e5e 2013-11-27 18:24:47,359 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming compacted file at hdfs://master.uc.uuc.com:9000/hbase/DocSortByHour/c5e2ce4602fb62714c72cd8da3b50bd5/.tmp/c2047a8a560645af936bcbcfe2625e5e to hdfs://master.uc.uuc.com:9000/hbase/DocSortByHour/c5e2ce4602fb62714c72cd8da3b50bd5/data/c2047a8a560645af936bcbcfe2625e5e 2013-11-27 18:24:47,364 INFO org.apache.hadoop.hbase.regionserver.StoreFile$Reader: Loaded Delete Family Bloom (CompoundBloomFilter) metadata for c2047a8a560645af936bcbcfe2625e5e 2013-11-27 18:24:47,404 INFO org.apache.hadoop.hbase.regionserver.Store: Completed compaction of 3 file(s) in data of DocSortByHour,\x00\x00\x011\xAE\xCB\xD8 \x00\x00,B\x00\x00\x05\x84\x00\x01\xE6\xAF\x00\x02\x09\x0D\x00\x00\x00\x00\x06\xB5c\xB3,1314093149893.c5e2ce4602fb62714c72cd8da3b50bd5. into c2047a8a560645af936bcbcfe2625e5e, size=47.9m; total size for store is 223.8m 2013-11-27 18:24:47,404 INFO org.apache.hadoop.hbase.regionserver.compactions.CompactionRequest: completed compaction: regionName=DocSortByHour,\x00\x00\x011\xAE\xCB\xD8 \x00\x00,B\x00\x00\x05\x84\x00\x01\xE6\xAF\x00\x02\x09\x0D\x00\x00\x00\x00\x06\xB5c\xB3,1314093149893.c5e2ce4602fb62714c72cd8da3b50bd5., storeName=data, fileCount=3, fileSize=47.9m, priority=3, time=1230521954697044; duration=3sec 2013-11-27 18:27:07,191 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region h,!\xACU\xD2,1375542268855.ae4d6fae898f79e9fa019b78c975dd95. after a delay of 10647 2013-11-27 18:27:17,190 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region h,!\xACU\xD2,1375542268855.ae4d6fae898f79e9fa019b78c975dd95. after a delay of 19328 2013-11-27 18:27:17,841 INFO org.apache.hadoop.hbase.util.FSUtils: FileSystem doesn't support getDefaultReplication 2013-11-27 18:27:17,841 INFO org.apache.hadoop.hbase.util.FSUtils: FileSystem doesn't support getDefaultBlockSize 2013-11-27 18:27:17,851 INFO org.apache.hadoop.hbase.regionserver.StoreFile: Delete Family Bloom filter type for hdfs://master.uc.uuc.com:9000/hbase/h/ae4d6fae898f79e9fa019b78c975dd95/.tmp/135b6223aa9a41c1bf3511c7a389baea: CompoundBloomFilterWriter 2013-11-27 18:27:17,863 INFO org.apache.hadoop.hbase.regionserver.StoreFile: NO General Bloom and NO DeleteFamily was added to HFile (hdfs://master.uc.uuc.com:9000/hbase/h/ae4d6fae898f79e9fa019b78c975dd95/.tmp/135b6223aa9a41c1bf3511c7a389baea) ------ and the Master worked well If I used version 0.94.3 ship with RS used version 0.94.13 .
        Hide
        chenjiajun added a comment -

        the error would seem to disappeared on my cluster by the follow steps :
        1 stop the hbase cluster
        2 remove/mv the data of zookeeper
        3 start the cluster

        otherwise there is a solution for somebody's problem :http://stackoverflow.com/questions/17792619/fatal-master-hmaster-unexpected-state-cannot-transit-it-to-offline

        Show
        chenjiajun added a comment - the error would seem to disappeared on my cluster by the follow steps : 1 stop the hbase cluster 2 remove/mv the data of zookeeper 3 start the cluster otherwise there is a solution for somebody's problem : http://stackoverflow.com/questions/17792619/fatal-master-hmaster-unexpected-state-cannot-transit-it-to-offline
        Hide
        Jean-Marc Spaggiari added a comment -

        I confirm. Faced the same issue yesterday. Removed the related ZNodes and restart the cluster fixed it.

        Show
        Jean-Marc Spaggiari added a comment - I confirm. Faced the same issue yesterday. Removed the related ZNodes and restart the cluster fixed it.
        Hide
        Jean-Marc Spaggiari added a comment -

        Faced the issue again today.

        With zkCli:
        rmr /hbase/unassigned

        Then start HBase and:
        bin/hbase hbck -fixAssignments

        Everything back clean.

        Show
        Jean-Marc Spaggiari added a comment - Faced the issue again today. With zkCli: rmr /hbase/unassigned Then start HBase and: bin/hbase hbck -fixAssignments Everything back clean.
        Hide
        Sergey Shelukhin added a comment -

        Well, that's a workaround. Is this pre-96 only?

        Show
        Sergey Shelukhin added a comment - Well, that's a workaround. Is this pre-96 only?
        Hide
        Jean-Marc Spaggiari added a comment -

        I don't know exactly where this is coming from, so I can not say if it's also there in >=0.96 or not. On my side I think this occurs because I'm playing with the load balancer. And I stopped the master while MANY regions where transitioning somewhere else. It might be possible to reproduce that with 0.96. Start 4 nodes, assign 400 regions, shut the nodes down one by one slowly to get all the regions to a single node. Bring them all back online, balance, kill master... And restart everything to see.

        Show
        Jean-Marc Spaggiari added a comment - I don't know exactly where this is coming from, so I can not say if it's also there in >=0.96 or not. On my side I think this occurs because I'm playing with the load balancer. And I stopped the master while MANY regions where transitioning somewhere else. It might be possible to reproduce that with 0.96. Start 4 nodes, assign 400 regions, shut the nodes down one by one slowly to get all the regions to a single node. Bring them all back online, balance, kill master... And restart everything to see.
        Hide
        Jean-Marc Spaggiari added a comment -

        Attached master logs while facing this issue.

        What seems suspect if the 4 calls to ClosedRegionHandler

        2013-12-27 14:14:10,148 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 77b599963dfbc9d296ed4ecf3217e927
        2013-12-27 14:14:11,405 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 77b599963dfbc9d296ed4ecf3217e927
        2013-12-27 14:14:11,798 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 77b599963dfbc9d296ed4ecf3217e927
        2013-12-27 14:14:11,840 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 77b599963dfbc9d296ed4ecf3217e927

        Show
        Jean-Marc Spaggiari added a comment - Attached master logs while facing this issue. What seems suspect if the 4 calls to ClosedRegionHandler 2013-12-27 14:14:10,148 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 77b599963dfbc9d296ed4ecf3217e927 2013-12-27 14:14:11,405 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 77b599963dfbc9d296ed4ecf3217e927 2013-12-27 14:14:11,798 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 77b599963dfbc9d296ed4ecf3217e927 2013-12-27 14:14:11,840 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 77b599963dfbc9d296ed4ecf3217e927
        Hide
        Lars Hofhansl added a comment -

        Just ran into this in HBASE-10248 (investigating test failures).
        I do not really know this code well, but wouldn't it be safe to transition ZK node from PENDING_OPEN to OFFLINE if we got here via an RS_ZK_REGION_FAILED_OPEN?
        In fact why can't we always transition from PENDING_OPEN to OFFLINE from the ClosedRegionHandler?
        PENDING_OPEN means the region is not open. Yet here we are in the ClosedRegionHandler.

        Show
        Lars Hofhansl added a comment - Just ran into this in HBASE-10248 (investigating test failures). I do not really know this code well, but wouldn't it be safe to transition ZK node from PENDING_OPEN to OFFLINE if we got here via an RS_ZK_REGION_FAILED_OPEN? In fact why can't we always transition from PENDING_OPEN to OFFLINE from the ClosedRegionHandler? PENDING_OPEN means the region is not open. Yet here we are in the ClosedRegionHandler.
        Hide
        Lars Hofhansl added a comment -

        In my case I found that the AssignmentManager gets two notification for the same region in short succession:

        2013-12-26 07:55:11,398 DEBUG [pool-1-thread-1-EventThread] zookeeper.ZooKeeperWatcher(294): master:36597-0x1432de5ff150000 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/50d8f699ee870d7af05aa4f4b6824e8c
        2013-12-26 07:55:11,399 DEBUG [pool-1-thread-1-EventThread] zookeeper.ZKUtil(1595): master:36597-0x1432de5ff150000 Retrieved 116 byte(s) of data from znode /hbase/unassigned/50d8f699ee870d7af05aa4f4b6824e8c and set watcher; region=testRetrying,ttt,1388044498231.50d8f699ee870d7af05aa4f4b6824e8c., origin=janus.apache.org,50758,1388044485760, state=RS_ZK_REGION_FAILED_OPEN
        2013-12-26 07:55:11,399 DEBUG [pool-1-thread-1-EventThread] master.AssignmentManager(743): Handling transition=RS_ZK_REGION_FAILED_OPEN, 
        

        and then

        2013-12-26 07:55:11,401 DEBUG [pool-1-thread-1-EventThread] zookeeper.ZooKeeperWatcher(294): master:36597-0x1432de5ff150000 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/50d8f699ee870d7af05aa4f4b6824e8c
        2013-12-26 07:55:11,401 DEBUG [MASTER_CLOSE_REGION-janus.apache.org,36597,1388044485155-0] handler.ClosedRegionHandler(92): Handling CLOSED event for 50d8f699ee870d7af05aa4f4b6824e8c
        2013-12-26 07:55:11,401 DEBUG [MASTER_CLOSE_REGION-janus.apache.org,36597,1388044485155-0] master.AssignmentManager(1665): Forcing OFFLINE; was=testRetrying,ttt,1388044498231.50d8f699ee870d7af05aa4f4b6824e8c. state=CLOSED, ts=1388044511364, server=janus.apache.org,50758,1388044485760
        2013-12-26 07:55:11,401 DEBUG [MASTER_CLOSE_REGION-janus.apache.org,36597,1388044485155-0] zookeeper.ZKAssign(264): master:36597-0x1432de5ff150000 Creating (or updating) unassigned node for 50d8f699ee870d7af05aa4f4b6824e8c with OFFLINE state
        2013-12-26 07:55:11,402 DEBUG [pool-1-thread-1-EventThread] zookeeper.ZKUtil(1595): master:36597-0x1432de5ff150000 Retrieved 116 byte(s) of data from znode /hbase/unassigned/50d8f699ee870d7af05aa4f4b6824e8c and set watcher; region=testRetrying,ttt,1388044498231.50d8f699ee870d7af05aa4f4b6824e8c., origin=janus.apache.org,50758,1388044485760, state=RS_ZK_REGION_FAILED_OPEN
        2013-12-26 07:55:11,402 DEBUG [pool-1-thread-1-EventThread] master.AssignmentManager(743): Handling transition=RS_ZK_REGION_FAILED_OPEN, server=janus.apache.org,50758,1388044485760, region=50d8f699ee870d7af05aa4f4b6824e8c
        

        Note in the 2nd set how the AM already started to react to the 1st event.

        Show
        Lars Hofhansl added a comment - In my case I found that the AssignmentManager gets two notification for the same region in short succession: 2013-12-26 07:55:11,398 DEBUG [pool-1-thread-1-EventThread] zookeeper.ZooKeeperWatcher(294): master:36597-0x1432de5ff150000 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/50d8f699ee870d7af05aa4f4b6824e8c 2013-12-26 07:55:11,399 DEBUG [pool-1-thread-1-EventThread] zookeeper.ZKUtil(1595): master:36597-0x1432de5ff150000 Retrieved 116 byte (s) of data from znode /hbase/unassigned/50d8f699ee870d7af05aa4f4b6824e8c and set watcher; region=testRetrying,ttt,1388044498231.50d8f699ee870d7af05aa4f4b6824e8c., origin=janus.apache.org,50758,1388044485760, state=RS_ZK_REGION_FAILED_OPEN 2013-12-26 07:55:11,399 DEBUG [pool-1-thread-1-EventThread] master.AssignmentManager(743): Handling transition=RS_ZK_REGION_FAILED_OPEN, and then 2013-12-26 07:55:11,401 DEBUG [pool-1-thread-1-EventThread] zookeeper.ZooKeeperWatcher(294): master:36597-0x1432de5ff150000 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/50d8f699ee870d7af05aa4f4b6824e8c 2013-12-26 07:55:11,401 DEBUG [MASTER_CLOSE_REGION-janus.apache.org,36597,1388044485155-0] handler.ClosedRegionHandler(92): Handling CLOSED event for 50d8f699ee870d7af05aa4f4b6824e8c 2013-12-26 07:55:11,401 DEBUG [MASTER_CLOSE_REGION-janus.apache.org,36597,1388044485155-0] master.AssignmentManager(1665): Forcing OFFLINE; was=testRetrying,ttt,1388044498231.50d8f699ee870d7af05aa4f4b6824e8c. state=CLOSED, ts=1388044511364, server=janus.apache.org,50758,1388044485760 2013-12-26 07:55:11,401 DEBUG [MASTER_CLOSE_REGION-janus.apache.org,36597,1388044485155-0] zookeeper.ZKAssign(264): master:36597-0x1432de5ff150000 Creating (or updating) unassigned node for 50d8f699ee870d7af05aa4f4b6824e8c with OFFLINE state 2013-12-26 07:55:11,402 DEBUG [pool-1-thread-1-EventThread] zookeeper.ZKUtil(1595): master:36597-0x1432de5ff150000 Retrieved 116 byte (s) of data from znode /hbase/unassigned/50d8f699ee870d7af05aa4f4b6824e8c and set watcher; region=testRetrying,ttt,1388044498231.50d8f699ee870d7af05aa4f4b6824e8c., origin=janus.apache.org,50758,1388044485760, state=RS_ZK_REGION_FAILED_OPEN 2013-12-26 07:55:11,402 DEBUG [pool-1-thread-1-EventThread] master.AssignmentManager(743): Handling transition=RS_ZK_REGION_FAILED_OPEN, server=janus.apache.org,50758,1388044485760, region=50d8f699ee870d7af05aa4f4b6824e8c Note in the 2nd set how the AM already started to react to the 1st event.
        Hide
        Lars Hofhansl added a comment -

        Something like this maybe?
        It's quite hard to reproduce locally (only happened once in ~100 runs of the test), so it is very hard to be sure that this indeed fixes the issue.

        Show
        Lars Hofhansl added a comment - Something like this maybe? It's quite hard to reproduce locally (only happened once in ~100 runs of the test), so it is very hard to be sure that this indeed fixes the issue.
        Hide
        Lars Hofhansl added a comment -

        ramkrishna.s.vasudevan, wanna have another look at this, too?

        Show
        Lars Hofhansl added a comment - ramkrishna.s.vasudevan , wanna have another look at this, too?
        Hide
        Lars Hofhansl added a comment -

        W.r.t. my last comment, I was running the test in the loop and my fix got applied by accident, so after the first run it did not fail... I'll try again to reproduce locally.

        Show
        Lars Hofhansl added a comment - W.r.t. my last comment, I was running the test in the loop and my fix got applied by accident, so after the first run it did not fail... I'll try again to reproduce locally.
        Hide
        Lars Hofhansl added a comment -

        More tests on my laptop. Without patch, tests fails about 1 out of 5 runs. With patch I ran the test in a loop over 150 times, no failure. It might not be the whole story, but it fix the Master abort caused by this test.

        Show
        Lars Hofhansl added a comment - More tests on my laptop. Without patch, tests fails about 1 out of 5 runs. With patch I ran the test in a loop over 150 times, no failure. It might not be the whole story, but it fix the Master abort caused by this test.
        Hide
        Lars Hofhansl added a comment -

        Of course now that I wrote this I got an a run without the patch of over 45 successes (still running). Meh.

        Show
        Lars Hofhansl added a comment - Of course now that I wrote this I got an a run without the patch of over 45 successes (still running). Meh.
        Hide
        Jean-Marc Spaggiari added a comment -

        But this doesn't explain why we are calling this twice. You fix might work, but should we not try to figure why we have some many calls? I can give a try to you fix in my environment if you want, and play with my balancers...

        Show
        Jean-Marc Spaggiari added a comment - But this doesn't explain why we are calling this twice. You fix might work, but should we not try to figure why we have some many calls? I can give a try to you fix in my environment if you want, and play with my balancers...
        Hide
        Lars Hofhansl added a comment -

        True. It does not explain two calls.

        There is a race condition, though. Notice that in AssignmentManager, assign calls addToRegionsInTransition, which either creates a new OFFLINE RegionState or sets the RegionState to OFFLINE (unless hijack is passed), but in the actual assignment it is set back to PENDING_OPEN. This can only happen by another thread in the HMaster process interfering in between. So rechecking with the lock on the appropriate RegionState held is a good thing anyway.

        If you could put it through the wringer that'd be cool.

        Show
        Lars Hofhansl added a comment - True. It does not explain two calls. There is a race condition, though. Notice that in AssignmentManager, assign calls addToRegionsInTransition, which either creates a new OFFLINE RegionState or sets the RegionState to OFFLINE (unless hijack is passed), but in the actual assignment it is set back to PENDING_OPEN. This can only happen by another thread in the HMaster process interfering in between. So rechecking with the lock on the appropriate RegionState held is a good thing anyway. If you could put it through the wringer that'd be cool.
        Hide
        Lars Hofhansl added a comment -

        Attaching full log from latest failed run, so it won't get lost.

        Show
        Lars Hofhansl added a comment - Attaching full log from latest failed run, so it won't get lost.
        Hide
        Lars Hofhansl added a comment -

        It seems what we really have to do is handling per region ZK events completely an strictly in the order in which they are delivered.

        Show
        Lars Hofhansl added a comment - It seems what we really have to do is handling per region ZK events completely an strictly in the order in which they are delivered.
        Hide
        Lars Hofhansl added a comment -

        A per region queue would be best here, but we can't rewrite the AssignmentManager in 0.94.

        Show
        Lars Hofhansl added a comment - A per region queue would be best here, but we can't rewrite the AssignmentManager in 0.94.
        Hide
        Lars Hofhansl added a comment -

        It can't just be a race, though, since it happens on master restart as well.
        Jean-Marc Spaggiari, do you still have the exact state of the znodes that causes the trouble at master restart? That might be easy to capture in a unit test.

        Show
        Lars Hofhansl added a comment - It can't just be a race, though, since it happens on master restart as well. Jean-Marc Spaggiari , do you still have the exact state of the znodes that causes the trouble at master restart? That might be easy to capture in a unit test.
        Hide
        Lars Hofhansl added a comment -

        So here's the hypothesis. A region bounces between PENDING_OPEN and FAILED_OPEN. Each time the state changes the AssignmentManager is notified, but when it reads the state, it'll always read the latest state (FAILED_OPEN), thus it gets two notification for FAILED_OPEN.

        I did one more test. Started HBase and created a table with COMPRESSION=>'SNAPPY'. Since I do not have SNAPPY installed the region keeps bouncing.

        Without the patch the HMaster reliably aborts every time.

        With this patch the HMaster continues to stay up, and eventually the region stops bouncing and stays in PENDING_OPEN.
        (Which means that the master eventually gives up).

        So the patch definitely fixes one of the issues! Does anybody thinks it will cause other issues?

        Show
        Lars Hofhansl added a comment - So here's the hypothesis. A region bounces between PENDING_OPEN and FAILED_OPEN. Each time the state changes the AssignmentManager is notified, but when it reads the state, it'll always read the latest state (FAILED_OPEN), thus it gets two notification for FAILED_OPEN. I did one more test. Started HBase and created a table with COMPRESSION=>'SNAPPY'. Since I do not have SNAPPY installed the region keeps bouncing. Without the patch the HMaster reliably aborts every time. With this patch the HMaster continues to stay up, and eventually the region stops bouncing and stays in PENDING_OPEN. (Which means that the master eventually gives up). So the patch definitely fixes one of the issues! Does anybody thinks it will cause other issues?
        Hide
        Lars Hofhansl added a comment -

        Upping to critical.

        The same effect can also be achieved by replacing the abort with an error log. setOfflineInZooKeeper would then return -1, and the caller would stop the assignment. I think for 0.94 that is best course of action, unless somebody has a better idea. The current state of affairs is unacceptable as it would actually lead to a cascading failure of all HMaster.

        Show
        Lars Hofhansl added a comment - Upping to critical. The same effect can also be achieved by replacing the abort with an error log. setOfflineInZooKeeper would then return -1, and the caller would stop the assignment. I think for 0.94 that is best course of action, unless somebody has a better idea. The current state of affairs is unacceptable as it would actually lead to a cascading failure of all HMaster.
        Hide
        Lars Hofhansl added a comment -

        Something like this. Tested with a snappy table again. HMaster stays up, region stops bouncing after a while.

        Show
        Lars Hofhansl added a comment - Something like this. Tested with a snappy table again. HMaster stays up, region stops bouncing after a while.
        Hide
        Jean-Marc Spaggiari added a comment -

        Sorry, I was off yesterday for the family Christmas party Master is down again this morning. Tried to restart it, went well.

        I will capture the related ZNode state next time he goes down. Should not be that long. The cultprit region this morning was from a table where snappy is enabled.

        Ok, just retried the balancer, crashed instantaneously

        Details of the culprit region in ZK:

        [zk: localhost:2181(CONNECTED) 4] get /hbase/unassigned/0b7c5e8c99e6a5f2039efed16b30ff9e
        �
         23767@node8 C?;k�~page,ac.topsgolb.sthguohtsrebruht.www http -1 /search/label/smoking%20ban null,1381226128817.0b7c5e8c99e6a5f2039efed16b30ff9e. *node8.distparser.com,60020,1388184406449
        cZxid = 0x59000abf87
        ctime = Sun Dec 29 11:41:30 EST 2013
        mZxid = 0x59000abffa
        mtime = Sun Dec 29 11:41:32 EST 2013
        pZxid = 0x59000abf87
        cversion = 0
        dataVersion = 8
        aclVersion = 0
        ephemeralOwner = 0x0
        dataLength = 198
        numChildren = 0
        

        And the logs

        2013-12-29 11:41:34,149 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 5e5bce123f3b37ca062763683383c858
        2013-12-29 11:41:34,170 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1d7aa3a9b29ee9ce2d58b1b73cfc5543
        2013-12-29 11:41:34,201 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 42fa0cb7de3ecf237bcf31b5e766f630
        2013-12-29 11:41:34,214 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 2a89feb2b068391ddf22a9066297d070
        2013-12-29 11:41:34,262 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 136e765b38663cdcc682ecff3d3e83f8
        2013-12-29 11:41:34,325 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 01188a4d5a6db06f14f03211330e5ebb
        2013-12-29 11:41:34,701 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 01188a4d5a6db06f14f03211330e5ebb
        2013-12-29 11:41:34,729 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for b75cb9067c3c4456d6198c9237c143b3
        2013-12-29 11:41:34,730 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for c576e5591b962f4b5d2850492432f95d
        2013-12-29 11:41:34,837 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for c18a27df687492cec914e85329fa17b5
        2013-12-29 11:41:34,873 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 17bee313797fc1ce982c0e31fdb6620c
        2013-12-29 11:41:35,090 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 0b7c5e8c99e6a5f2039efed16b30ff9e
        2013-12-29 11:41:35,113 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for d70b153d72679b1ee7ac6e1979df3c1f
        2013-12-29 11:41:35,114 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 0b7c5e8c99e6a5f2039efed16b30ff9e
        2013-12-29 11:41:35,187 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1d7aa3a9b29ee9ce2d58b1b73cfc5543
        2013-12-29 11:41:35,192 FATAL org.apache.hadoop.hbase.master.HMaster: Master server abort: loaded coprocessors are: []
        2013-12-29 11:41:35,193 FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected state : page,ac.topsgolb.sthguohtsrebruht.www\x1Fhttp\x1F-1\x1F/search/label/smoking%20ban\x1Fnull,1381226128817.0b7c5e8c99e6a5f2039efed16b30ff9e. state=PENDING_OPEN, ts=1388335295192, server=node8.distparser.com,60020,1388184406449 .. Cannot transit it to OFFLINE.
        java.lang.IllegalStateException: Unexpected state : page,ac.topsgolb.sthguohtsrebruht.www\x1Fhttp\x1F-1\x1F/search/label/smoking%20ban\x1Fnull,1381226128817.0b7c5e8c99e6a5f2039efed16b30ff9e. state=PENDING_OPEN, ts=1388335295192, server=node8.distparser.com,60020,1388184406449 .. Cannot transit it to OFFLINE.
            at org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1890)
            at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1690)
            at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1426)
            at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1398)
            at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1393)
            at org.apache.hadoop.hbase.master.handler.ClosedRegionHandler.process(ClosedRegionHandler.java:105)
            at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
            at java.lang.Thread.run(Thread.java:744)
        2013-12-29 11:41:35,195 INFO org.apache.hadoop.hbase.master.HMaster: Aborting
        2013-12-29 11:41:35,195 DEBUG org.apache.hadoop.hbase.master.HMaster: Stopping service threads
        2013-12-29 11:41:35,195 INFO org.apache.hadoop.hbase.master.HMaster$2: node3.distparser.com,60000,1388335116794-BalancerChore exiting
        2013-12-29 11:41:35,195 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server on 60000
        2013-12-29 11:41:35,196 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 6 on 60000: exiting
        2013-12-29 11:41:35,196 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 1 on 60000: exiting
        2013-12-29 11:41:35,196 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC Server listener on 60000
        2013-12-29 11:41:35,196 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 8 on 60000: exiting
        2013-12-29 11:41:35,196 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 9 on 60000: exiting
        2013-12-29 11:41:35,196 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 3 on 60000: exiting
        2013-12-29 11:41:35,196 INFO org.apache.hadoop.ipc.HBaseServer: REPL IPC Server handler 1 on 60000: exiting
        2013-12-29 11:41:35,196 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 7 on 60000: exiting
        2013-12-29 11:41:35,196 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 5 on 60000: exiting
        2013-12-29 11:41:35,196 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 2 on 60000: exiting
        2013-12-29 11:41:35,196 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 4 on 60000: exiting
        2013-12-29 11:41:35,196 INFO org.apache.hadoop.hbase.master.cleaner.LogCleaner: master-node3.distparser.com,60000,1388335116794.oldLogCleaner exiting
        2013-12-29 11:41:35,196 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 0 on 60000: exiting
        2013-12-29 11:41:35,196 INFO org.apache.hadoop.ipc.HBaseServer: REPL IPC Server handler 0 on 60000: exiting
        2013-12-29 11:41:35,196 INFO org.apache.hadoop.ipc.HBaseServer: REPL IPC Server handler 2 on 60000: exiting
        2013-12-29 11:41:35,196 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC Server Responder
        2013-12-29 11:41:35,196 INFO org.apache.hadoop.hbase.master.cleaner.HFileCleaner: master-node3.distparser.com,60000,1388335116794.archivedHFileCleaner exiting
        2013-12-29 11:41:35,196 INFO org.apache.hadoop.hbase.master.HMaster: Stopping infoServer
        2013-12-29 11:41:35,197 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC Server Responder
        2013-12-29 11:41:35,197 INFO org.mortbay.log: Stopped SelectChannelConnector@0.0.0.0:60010
        2013-12-29 11:41:35,222 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 42fa0cb7de3ecf237bcf31b5e766f630
        2013-12-29 11:41:35,296 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for f1607272fcbb0dcf16d0bb96f59a9001
        2013-12-29 11:41:35,297 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 136e765b38663cdcc682ecff3d3e83f8
        2013-12-29 11:41:35,298 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil: master:60000-0x34335c5090e0483 Unable to get data of znode /hbase/unassigned/01188a4d5a6db06f14f03211330e5ebb
        java.lang.InterruptedException
            at java.lang.Object.wait(Native Method)
            at java.lang.Object.wait(Object.java:503)
            at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1309)
            at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1149)
            at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:290)
            at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataNoWatch(ZKUtil.java:746)
            at org.apache.hadoop.hbase.zookeeper.ZKAssign.deleteNode(ZKAssign.java:488)
            at org.apache.hadoop.hbase.master.handler.OpenedRegionHandler.deleteOpenedNode(OpenedRegionHandler.java:129)
            at org.apache.hadoop.hbase.master.handler.OpenedRegionHandler.process(OpenedRegionHandler.java:103)
            at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
            at java.lang.Thread.run(Thread.java:744)
        2013-12-29 11:41:35,298 WARN org.apache.hadoop.hbase.master.AssignmentManager: Attempted to create/force node into OFFLINE state before completing assignment but failed to do so for page_proposed,\x13\x00\x00\x00\x00\x00\x00\x00ac.lligcm.www\x1Fhttp\x1F-1\x1F/study/2012-2013/faculties/science/undergraduate/courses/\x1Ffilters=language%3Aen%20ss_faculty%3AAG%20ss_faculty%3ASC%20ss_course_level%3A%22300%20level%22%20ss_course_level%3A%22200%20level%22%20ss_dept%3A0175%20ss_dept%3A0054%20ss_dept%3A0291%20ss_dept%3A0065%20ss_dept%3A0290%20ss_dept%3A0212%20ss_dept%3A0293%20ss_dept%3A0259%20ss_dept%3A0076%20sm_level%3AUndergraduate&solrsort=sort_title%20asc&BIGipServer%7ECCS_Sties%7EDRUPAL=884070532.20480.0000&BIGipServer%7ECCS_Sties%7EDRUPAL-443=867293316.47873.0000,1384353034924.f1607272fcbb0dcf16d0bb96f59a9001. state=OFFLINE, ts=1388335295296, server=null
        2013-12-29 11:41:35,299 WARN org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: The znode of the region page,pj.iravaj.www\x1Fhttp\x1F-1\x1F/landingpage/brandwomen/ref=topnav_sd_brand_s\x1Fie=UTF8&pf_rd_r=1D0TY2CPA9QH0C5C3ZED&pf_rd_m=A2EVFYIVZSJ3AR&pf_rd_t=301&pf_rd_i=0&pf_rd_p=122455449&pf_rd_s=headernav,1379104103471.01188a4d5a6db06f14f03211330e5ebb. would have already been deleted
        2013-12-29 11:41:35,299 ERROR org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: The znode of region page,pj.iravaj.www\x1Fhttp\x1F-1\x1F/landingpage/brandwomen/ref=topnav_sd_brand_s\x1Fie=UTF8&pf_rd_r=1D0TY2CPA9QH0C5C3ZED&pf_rd_m=A2EVFYIVZSJ3AR&pf_rd_t=301&pf_rd_i=0&pf_rd_p=122455449&pf_rd_s=headernav,1379104103471.01188a4d5a6db06f14f03211330e5ebb. could not be deleted.
        2013-12-29 11:41:35,299 WARN org.apache.hadoop.hbase.master.AssignmentManager: Attempted to create/force node into OFFLINE state before completing assignment but failed to do so for page,ofni.adanacniedamyllaer\x1Fhttp\x1F-1\x1F/dg/pe/auto_fuel-conversions/retirement_planning-consultants/commercial_commercial-&-industrial/cheese/pet_grooming-&-washing/golf_practice-&-driving-ranges/music_instrumental-instruction/associations/clubs/homes_special-care/\x1Fnull,1379104103471.136e765b38663cdcc682ecff3d3e83f8. state=OFFLINE, ts=1388335295297, server=null
        2013-12-29 11:41:35,308 INFO org.apache.hadoop.hbase.master.AssignmentManager$TimeoutMonitor: node3.distparser.com,60000,1388335116794.timeoutMonitor exiting
        2013-12-29 11:41:35,326 DEBUG org.apache.hadoop.hbase.catalog.CatalogTracker: Stopping catalog tracker org.apache.hadoop.hbase.catalog.CatalogTracker@4c8d0351
        2013-12-29 11:41:35,327 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region 01188a4d5a6db06f14f03211330e5ebb from server node5.distparser.com,60020,1388184401716 but region was in  the state page,pj.iravaj.www\x1Fhttp\x1F-1\x1F/landingpage/brandwomen/ref=topnav_sd_brand_s\x1Fie=UTF8&pf_rd_r=1D0TY2CPA9QH0C5C3ZED&pf_rd_m=A2EVFYIVZSJ3AR&pf_rd_t=301&pf_rd_i=0&pf_rd_p=122455449&pf_rd_s=headernav,1379104103471.01188a4d5a6db06f14f03211330e5ebb. state=OPEN, ts=1388335290361, server=node5.distparser.com,60020,1388184401716 and not in expected OFFLINE, PENDING_OPEN or OPENING states
        2013-12-29 11:41:35,333 ERROR org.apache.hadoop.hbase.executor.ExecutorService: Cannot submit [ClosedRegionHandler-node3.distparser.com,60000,1388335116794-42] because the executor is missing. Is this process shutting down?
        2013-12-29 11:41:35,335 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region 01188a4d5a6db06f14f03211330e5ebb from server node5.distparser.com,60020,1388184401716 but region was in  the state page,pj.iravaj.www\x1Fhttp\x1F-1\x1F/landingpage/brandwomen/ref=topnav_sd_brand_s\x1Fie=UTF8&pf_rd_r=1D0TY2CPA9QH0C5C3ZED&pf_rd_m=A2EVFYIVZSJ3AR&pf_rd_t=301&pf_rd_i=0&pf_rd_p=122455449&pf_rd_s=headernav,1379104103471.01188a4d5a6db06f14f03211330e5ebb. state=OPEN, ts=1388335290361, server=node5.distparser.com,60020,1388184401716 and not in expected OFFLINE, PENDING_OPEN or OPENING states
        2013-12-29 11:41:35,335 ERROR org.apache.hadoop.hbase.executor.ExecutorService: Cannot submit [ClosedRegionHandler-node3.distparser.com,60000,1388335116794-43] because the executor is missing. Is this process shutting down?
        2013-12-29 11:41:35,336 ERROR org.apache.hadoop.hbase.executor.ExecutorService: Cannot submit [ClosedRegionHandler-node3.distparser.com,60000,1388335116794-44] because the executor is missing. Is this process shutting down?
        2013-12-29 11:41:35,336 ERROR org.apache.hadoop.hbase.executor.ExecutorService: Cannot submit [ClosedRegionHandler-node3.distparser.com,60000,1388335116794-45] because the executor is missing. Is this process shutting down?
        2013-12-29 11:41:35,337 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region 01188a4d5a6db06f14f03211330e5ebb from server node5.distparser.com,60020,1388184401716 but region was in  the state page,pj.iravaj.www\x1Fhttp\x1F-1\x1F/landingpage/brandwomen/ref=topnav_sd_brand_s\x1Fie=UTF8&pf_rd_r=1D0TY2CPA9QH0C5C3ZED&pf_rd_m=A2EVFYIVZSJ3AR&pf_rd_t=301&pf_rd_i=0&pf_rd_p=122455449&pf_rd_s=headernav,1379104103471.01188a4d5a6db06f14f03211330e5ebb. state=OPEN, ts=1388335290361, server=node5.distparser.com,60020,1388184401716 and not in expected OFFLINE, PENDING_OPEN or OPENING states
        2013-12-29 11:41:35,337 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received FAILED_OPEN for region b75cb9067c3c4456d6198c9237c143b3 from server node4.distparser.com,60020,1388184407264 but region was in  the state page,rf.idua.www\x1Fhttp\x1F-1\x1F/fr/brand/fr/audi_fleet_solutions/contact/contact_transport_personnes.html\x1Fnull,1379103792232.b75cb9067c3c4456d6198c9237c143b3. state=CLOSED, ts=1388335295034, server=node4.distparser.com,60020,1388184407264 and not in OFFLINE, PENDING_OPEN or OPENING
        2013-12-29 11:41:35,339 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received FAILED_OPEN for region c576e5591b962f4b5d2850492432f95d from server node4.distparser.com,60020,1388184407264 but region was in  the state page,ten.yrotceridsrevomsrekcap.www\x1Fhttp\x1F-1\x1F/Packers_Movers_kanpur.html\x1Fnull,1379081871633.c576e5591b962f4b5d2850492432f95d. state=CLOSED, ts=1388335295151, server=node4.distparser.com,60020,1388184407264 and not in OFFLINE, PENDING_OPEN or OPENING
        2013-12-29 11:41:35,414 ERROR org.apache.hadoop.hbase.executor.ExecutorService: Cannot submit [OpenedRegionHandler-node3.distparser.com,60000,1388335116794-46] because the executor is missing. Is this process shutting down?
        2013-12-29 11:41:35,426 INFO org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Closed zookeeper sessionid=0x24262655e121196
        2013-12-29 11:41:35,429 ERROR org.apache.hadoop.hbase.executor.ExecutorService: Cannot submit [OpenedRegionHandler-node3.distparser.com,60000,1388335116794-47] because the executor is missing. Is this process shutting down?
        2013-12-29 11:41:35,448 INFO org.apache.hadoop.hbase.master.AssignmentManager$TimerUpdater: node3.distparser.com,60000,1388335116794.timerUpdater exiting
        2013-12-29 11:41:35,448 INFO org.apache.hadoop.hbase.master.SplitLogManager$TimeoutMonitor: node3.distparser.com,60000,1388335116794.splitLogManagerTimeoutMonitor exiting
        2013-12-29 11:41:35,460 INFO org.apache.hadoop.hbase.master.HMaster: HMaster main thread exiting
        2013-12-29 11:41:35,460 ERROR org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start master
        java.lang.RuntimeException: HMaster Aborted
            at org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:160)
            at org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:104)
            at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
            at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:76)
            at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:2120)
        

        Again, multiple time the same region called in ClosedRegionHandler...
        I will apply your patch on my master and retry.

        Show
        Jean-Marc Spaggiari added a comment - Sorry, I was off yesterday for the family Christmas party Master is down again this morning. Tried to restart it, went well. I will capture the related ZNode state next time he goes down. Should not be that long. The cultprit region this morning was from a table where snappy is enabled. Ok, just retried the balancer, crashed instantaneously Details of the culprit region in ZK: [zk: localhost:2181(CONNECTED) 4] get /hbase/unassigned/0b7c5e8c99e6a5f2039efed16b30ff9e � 23767@node8 C?;k�~page,ac.topsgolb.sthguohtsrebruht.www http -1 /search/label/smoking%20ban null ,1381226128817.0b7c5e8c99e6a5f2039efed16b30ff9e. *node8.distparser.com,60020,1388184406449 cZxid = 0x59000abf87 ctime = Sun Dec 29 11:41:30 EST 2013 mZxid = 0x59000abffa mtime = Sun Dec 29 11:41:32 EST 2013 pZxid = 0x59000abf87 cversion = 0 dataVersion = 8 aclVersion = 0 ephemeralOwner = 0x0 dataLength = 198 numChildren = 0 And the logs 2013-12-29 11:41:34,149 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 5e5bce123f3b37ca062763683383c858 2013-12-29 11:41:34,170 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1d7aa3a9b29ee9ce2d58b1b73cfc5543 2013-12-29 11:41:34,201 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 42fa0cb7de3ecf237bcf31b5e766f630 2013-12-29 11:41:34,214 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 2a89feb2b068391ddf22a9066297d070 2013-12-29 11:41:34,262 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 136e765b38663cdcc682ecff3d3e83f8 2013-12-29 11:41:34,325 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 01188a4d5a6db06f14f03211330e5ebb 2013-12-29 11:41:34,701 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 01188a4d5a6db06f14f03211330e5ebb 2013-12-29 11:41:34,729 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for b75cb9067c3c4456d6198c9237c143b3 2013-12-29 11:41:34,730 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for c576e5591b962f4b5d2850492432f95d 2013-12-29 11:41:34,837 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for c18a27df687492cec914e85329fa17b5 2013-12-29 11:41:34,873 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 17bee313797fc1ce982c0e31fdb6620c 2013-12-29 11:41:35,090 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 0b7c5e8c99e6a5f2039efed16b30ff9e 2013-12-29 11:41:35,113 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for d70b153d72679b1ee7ac6e1979df3c1f 2013-12-29 11:41:35,114 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 0b7c5e8c99e6a5f2039efed16b30ff9e 2013-12-29 11:41:35,187 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1d7aa3a9b29ee9ce2d58b1b73cfc5543 2013-12-29 11:41:35,192 FATAL org.apache.hadoop.hbase.master.HMaster: Master server abort: loaded coprocessors are: [] 2013-12-29 11:41:35,193 FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected state : page,ac.topsgolb.sthguohtsrebruht.www\x1Fhttp\x1F-1\x1F/search/label/smoking%20ban\x1Fnull,1381226128817.0b7c5e8c99e6a5f2039efed16b30ff9e. state=PENDING_OPEN, ts=1388335295192, server=node8.distparser.com,60020,1388184406449 .. Cannot transit it to OFFLINE. java.lang.IllegalStateException: Unexpected state : page,ac.topsgolb.sthguohtsrebruht.www\x1Fhttp\x1F-1\x1F/search/label/smoking%20ban\x1Fnull,1381226128817.0b7c5e8c99e6a5f2039efed16b30ff9e. state=PENDING_OPEN, ts=1388335295192, server=node8.distparser.com,60020,1388184406449 .. Cannot transit it to OFFLINE. at org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1890) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1690) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1426) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1398) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1393) at org.apache.hadoop.hbase.master.handler.ClosedRegionHandler.process(ClosedRegionHandler.java:105) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang. Thread .run( Thread .java:744) 2013-12-29 11:41:35,195 INFO org.apache.hadoop.hbase.master.HMaster: Aborting 2013-12-29 11:41:35,195 DEBUG org.apache.hadoop.hbase.master.HMaster: Stopping service threads 2013-12-29 11:41:35,195 INFO org.apache.hadoop.hbase.master.HMaster$2: node3.distparser.com,60000,1388335116794-BalancerChore exiting 2013-12-29 11:41:35,195 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server on 60000 2013-12-29 11:41:35,196 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 6 on 60000: exiting 2013-12-29 11:41:35,196 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 1 on 60000: exiting 2013-12-29 11:41:35,196 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC Server listener on 60000 2013-12-29 11:41:35,196 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 8 on 60000: exiting 2013-12-29 11:41:35,196 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 9 on 60000: exiting 2013-12-29 11:41:35,196 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 3 on 60000: exiting 2013-12-29 11:41:35,196 INFO org.apache.hadoop.ipc.HBaseServer: REPL IPC Server handler 1 on 60000: exiting 2013-12-29 11:41:35,196 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 7 on 60000: exiting 2013-12-29 11:41:35,196 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 5 on 60000: exiting 2013-12-29 11:41:35,196 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 2 on 60000: exiting 2013-12-29 11:41:35,196 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 4 on 60000: exiting 2013-12-29 11:41:35,196 INFO org.apache.hadoop.hbase.master.cleaner.LogCleaner: master-node3.distparser.com,60000,1388335116794.oldLogCleaner exiting 2013-12-29 11:41:35,196 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 0 on 60000: exiting 2013-12-29 11:41:35,196 INFO org.apache.hadoop.ipc.HBaseServer: REPL IPC Server handler 0 on 60000: exiting 2013-12-29 11:41:35,196 INFO org.apache.hadoop.ipc.HBaseServer: REPL IPC Server handler 2 on 60000: exiting 2013-12-29 11:41:35,196 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC Server Responder 2013-12-29 11:41:35,196 INFO org.apache.hadoop.hbase.master.cleaner.HFileCleaner: master-node3.distparser.com,60000,1388335116794.archivedHFileCleaner exiting 2013-12-29 11:41:35,196 INFO org.apache.hadoop.hbase.master.HMaster: Stopping infoServer 2013-12-29 11:41:35,197 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC Server Responder 2013-12-29 11:41:35,197 INFO org.mortbay.log: Stopped SelectChannelConnector@0.0.0.0:60010 2013-12-29 11:41:35,222 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 42fa0cb7de3ecf237bcf31b5e766f630 2013-12-29 11:41:35,296 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for f1607272fcbb0dcf16d0bb96f59a9001 2013-12-29 11:41:35,297 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 136e765b38663cdcc682ecff3d3e83f8 2013-12-29 11:41:35,298 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil: master:60000-0x34335c5090e0483 Unable to get data of znode /hbase/unassigned/01188a4d5a6db06f14f03211330e5ebb java.lang.InterruptedException at java.lang. Object .wait(Native Method) at java.lang. Object .wait( Object .java:503) at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1309) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1149) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:290) at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataNoWatch(ZKUtil.java:746) at org.apache.hadoop.hbase.zookeeper.ZKAssign.deleteNode(ZKAssign.java:488) at org.apache.hadoop.hbase.master.handler.OpenedRegionHandler.deleteOpenedNode(OpenedRegionHandler.java:129) at org.apache.hadoop.hbase.master.handler.OpenedRegionHandler.process(OpenedRegionHandler.java:103) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang. Thread .run( Thread .java:744) 2013-12-29 11:41:35,298 WARN org.apache.hadoop.hbase.master.AssignmentManager: Attempted to create/force node into OFFLINE state before completing assignment but failed to do so for page_proposed,\x13\x00\x00\x00\x00\x00\x00\x00ac.lligcm.www\x1Fhttp\x1F-1\x1F/study/2012-2013/faculties/science/undergraduate/courses/\x1Ffilters=language%3Aen%20ss_faculty%3AAG%20ss_faculty%3ASC%20ss_course_level%3A%22300%20level%22%20ss_course_level%3A%22200%20level%22%20ss_dept%3A0175%20ss_dept%3A0054%20ss_dept%3A0291%20ss_dept%3A0065%20ss_dept%3A0290%20ss_dept%3A0212%20ss_dept%3A0293%20ss_dept%3A0259%20ss_dept%3A0076%20sm_level%3AUndergraduate&solrsort=sort_title%20asc&BIGipServer%7ECCS_Sties%7EDRUPAL=884070532.20480.0000&BIGipServer%7ECCS_Sties%7EDRUPAL-443=867293316.47873.0000,1384353034924.f1607272fcbb0dcf16d0bb96f59a9001. state=OFFLINE, ts=1388335295296, server= null 2013-12-29 11:41:35,299 WARN org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: The znode of the region page,pj.iravaj.www\x1Fhttp\x1F-1\x1F/landingpage/brandwomen/ref=topnav_sd_brand_s\x1Fie=UTF8&pf_rd_r=1D0TY2CPA9QH0C5C3ZED&pf_rd_m=A2EVFYIVZSJ3AR&pf_rd_t=301&pf_rd_i=0&pf_rd_p=122455449&pf_rd_s=headernav,1379104103471.01188a4d5a6db06f14f03211330e5ebb. would have already been deleted 2013-12-29 11:41:35,299 ERROR org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: The znode of region page,pj.iravaj.www\x1Fhttp\x1F-1\x1F/landingpage/brandwomen/ref=topnav_sd_brand_s\x1Fie=UTF8&pf_rd_r=1D0TY2CPA9QH0C5C3ZED&pf_rd_m=A2EVFYIVZSJ3AR&pf_rd_t=301&pf_rd_i=0&pf_rd_p=122455449&pf_rd_s=headernav,1379104103471.01188a4d5a6db06f14f03211330e5ebb. could not be deleted. 2013-12-29 11:41:35,299 WARN org.apache.hadoop.hbase.master.AssignmentManager: Attempted to create/force node into OFFLINE state before completing assignment but failed to do so for page,ofni.adanacniedamyllaer\x1Fhttp\x1F-1\x1F/dg/pe/auto_fuel-conversions/retirement_planning-consultants/commercial_commercial-&-industrial/cheese/pet_grooming-&-washing/golf_practice-&-driving-ranges/music_instrumental-instruction/associations/clubs/homes_special-care/\x1Fnull,1379104103471.136e765b38663cdcc682ecff3d3e83f8. state=OFFLINE, ts=1388335295297, server= null 2013-12-29 11:41:35,308 INFO org.apache.hadoop.hbase.master.AssignmentManager$TimeoutMonitor: node3.distparser.com,60000,1388335116794.timeoutMonitor exiting 2013-12-29 11:41:35,326 DEBUG org.apache.hadoop.hbase.catalog.CatalogTracker: Stopping catalog tracker org.apache.hadoop.hbase.catalog.CatalogTracker@4c8d0351 2013-12-29 11:41:35,327 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region 01188a4d5a6db06f14f03211330e5ebb from server node5.distparser.com,60020,1388184401716 but region was in the state page,pj.iravaj.www\x1Fhttp\x1F-1\x1F/landingpage/brandwomen/ref=topnav_sd_brand_s\x1Fie=UTF8&pf_rd_r=1D0TY2CPA9QH0C5C3ZED&pf_rd_m=A2EVFYIVZSJ3AR&pf_rd_t=301&pf_rd_i=0&pf_rd_p=122455449&pf_rd_s=headernav,1379104103471.01188a4d5a6db06f14f03211330e5ebb. state=OPEN, ts=1388335290361, server=node5.distparser.com,60020,1388184401716 and not in expected OFFLINE, PENDING_OPEN or OPENING states 2013-12-29 11:41:35,333 ERROR org.apache.hadoop.hbase.executor.ExecutorService: Cannot submit [ClosedRegionHandler-node3.distparser.com,60000,1388335116794-42] because the executor is missing. Is this process shutting down? 2013-12-29 11:41:35,335 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region 01188a4d5a6db06f14f03211330e5ebb from server node5.distparser.com,60020,1388184401716 but region was in the state page,pj.iravaj.www\x1Fhttp\x1F-1\x1F/landingpage/brandwomen/ref=topnav_sd_brand_s\x1Fie=UTF8&pf_rd_r=1D0TY2CPA9QH0C5C3ZED&pf_rd_m=A2EVFYIVZSJ3AR&pf_rd_t=301&pf_rd_i=0&pf_rd_p=122455449&pf_rd_s=headernav,1379104103471.01188a4d5a6db06f14f03211330e5ebb. state=OPEN, ts=1388335290361, server=node5.distparser.com,60020,1388184401716 and not in expected OFFLINE, PENDING_OPEN or OPENING states 2013-12-29 11:41:35,335 ERROR org.apache.hadoop.hbase.executor.ExecutorService: Cannot submit [ClosedRegionHandler-node3.distparser.com,60000,1388335116794-43] because the executor is missing. Is this process shutting down? 2013-12-29 11:41:35,336 ERROR org.apache.hadoop.hbase.executor.ExecutorService: Cannot submit [ClosedRegionHandler-node3.distparser.com,60000,1388335116794-44] because the executor is missing. Is this process shutting down? 2013-12-29 11:41:35,336 ERROR org.apache.hadoop.hbase.executor.ExecutorService: Cannot submit [ClosedRegionHandler-node3.distparser.com,60000,1388335116794-45] because the executor is missing. Is this process shutting down? 2013-12-29 11:41:35,337 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region 01188a4d5a6db06f14f03211330e5ebb from server node5.distparser.com,60020,1388184401716 but region was in the state page,pj.iravaj.www\x1Fhttp\x1F-1\x1F/landingpage/brandwomen/ref=topnav_sd_brand_s\x1Fie=UTF8&pf_rd_r=1D0TY2CPA9QH0C5C3ZED&pf_rd_m=A2EVFYIVZSJ3AR&pf_rd_t=301&pf_rd_i=0&pf_rd_p=122455449&pf_rd_s=headernav,1379104103471.01188a4d5a6db06f14f03211330e5ebb. state=OPEN, ts=1388335290361, server=node5.distparser.com,60020,1388184401716 and not in expected OFFLINE, PENDING_OPEN or OPENING states 2013-12-29 11:41:35,337 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received FAILED_OPEN for region b75cb9067c3c4456d6198c9237c143b3 from server node4.distparser.com,60020,1388184407264 but region was in the state page,rf.idua.www\x1Fhttp\x1F-1\x1F/fr/brand/fr/audi_fleet_solutions/contact/contact_transport_personnes.html\x1Fnull,1379103792232.b75cb9067c3c4456d6198c9237c143b3. state=CLOSED, ts=1388335295034, server=node4.distparser.com,60020,1388184407264 and not in OFFLINE, PENDING_OPEN or OPENING 2013-12-29 11:41:35,339 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received FAILED_OPEN for region c576e5591b962f4b5d2850492432f95d from server node4.distparser.com,60020,1388184407264 but region was in the state page,ten.yrotceridsrevomsrekcap.www\x1Fhttp\x1F-1\x1F/Packers_Movers_kanpur.html\x1Fnull,1379081871633.c576e5591b962f4b5d2850492432f95d. state=CLOSED, ts=1388335295151, server=node4.distparser.com,60020,1388184407264 and not in OFFLINE, PENDING_OPEN or OPENING 2013-12-29 11:41:35,414 ERROR org.apache.hadoop.hbase.executor.ExecutorService: Cannot submit [OpenedRegionHandler-node3.distparser.com,60000,1388335116794-46] because the executor is missing. Is this process shutting down? 2013-12-29 11:41:35,426 INFO org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Closed zookeeper sessionid=0x24262655e121196 2013-12-29 11:41:35,429 ERROR org.apache.hadoop.hbase.executor.ExecutorService: Cannot submit [OpenedRegionHandler-node3.distparser.com,60000,1388335116794-47] because the executor is missing. Is this process shutting down? 2013-12-29 11:41:35,448 INFO org.apache.hadoop.hbase.master.AssignmentManager$TimerUpdater: node3.distparser.com,60000,1388335116794.timerUpdater exiting 2013-12-29 11:41:35,448 INFO org.apache.hadoop.hbase.master.SplitLogManager$TimeoutMonitor: node3.distparser.com,60000,1388335116794.splitLogManagerTimeoutMonitor exiting 2013-12-29 11:41:35,460 INFO org.apache.hadoop.hbase.master.HMaster: HMaster main thread exiting 2013-12-29 11:41:35,460 ERROR org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start master java.lang.RuntimeException: HMaster Aborted at org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:160) at org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:104) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:76) at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:2120) Again, multiple time the same region called in ClosedRegionHandler... I will apply your patch on my master and retry.
        Hide
        stack added a comment -

        So, with this patch, is it the timeout monitor thread that effects the repair and gets us going again? That would be better than a crash. Patch looks safe enough. It continues to let the assign fail; it just leaves fixup to the TimeoutMonitor for this one case. Jimmy Xiang What you think boss?

        Show
        stack added a comment - So, with this patch, is it the timeout monitor thread that effects the repair and gets us going again? That would be better than a crash. Patch looks safe enough. It continues to let the assign fail; it just leaves fixup to the TimeoutMonitor for this one case. Jimmy Xiang What you think boss?
        Hide
        Jean-Marc Spaggiari added a comment -

        I tried the patch, and I think that it just moved the issue further

        First, I restored default balancer to get normal behaviour.

        2013-12-29 13:20:24,408 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server node1.domain.com,60020,1388341141398: Exception refreshing OPENING; region=87dc596f763bd1b43a63c4afd93e4f00, context=post_region_open
        org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/unassigned/87dc596f763bd1b43a63c4afd93e4f00
            at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
            at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
            at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266)
            at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:349)
            at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:848)
            at org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNode(ZKAssign.java:811)
            at org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNode(ZKAssign.java:747)
            at org.apache.hadoop.hbase.zookeeper.ZKAssign.retransitionNodeOpening(ZKAssign.java:674)
            at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.tickleOpening(OpenRegionHandler.java:380)
            at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:108)
            at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
            at java.lang.Thread.run(Thread.java:744)
        2013-12-29 13:20:24,413 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: []
        2013-12-29 13:20:24,420 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed refreshing OPENING; region=87dc596f763bd1b43a63c4afd93e4f00, context=post_region_open
        2013-12-29 13:20:24,421 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x1427652a35a108f Attempt to transition the unassigned node for 404a7ac95dc8ce89826206453c501e2a from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING failed, the node existed and was in the expected state but then when setting data we got a version mismatch
        2013-12-29 13:20:24,423 INFO org.mortbay.log: Stopped SelectChannelConnector@0.0.0.0:60030
        2013-12-29 13:20:24,434 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x1427652a35a108f Attempt to transition the unassigned node for 87dc596f763bd1b43a63c4afd93e4f00 from RS_ZK_REGION_OPENING to RS_ZK_REGION_FAILED_OPEN failed, the node existed but was in the state M_ZK_REGION_OFFLINE set by the server node1.domain.com,60020,1388341141398
        2013-12-29 13:20:24,435 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Unable to mark region {NAME => 'page,moc.krowtenrehtaeweht.www\x1Fhttp\x1F-1\x1F/gardening/cask0109\x1Fnull,1379303806726.87dc596f763bd1b43a63c4afd93e4f00.', STARTKEY => 'moc.krowtenrehtaeweht.www\x1Fhttp\x1F-1\x1F/gardening/cask0109\x1Fnull', ENDKEY => 'moc.nuhc9.iahgnahs\x1Fhttp\x1F-1\x1F/travels/23865/\x1Fnull', ENCODED => 87dc596f763bd1b43a63c4afd93e4f00,} as FAILED_OPEN. It's likely that the master already timed out this open attempt, and thus another RS already has the region.
        2013-12-29 13:20:24,435 ERROR org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while processing event M_RS_OPEN_REGION
        java.io.IOException: Aborting flush because server is abortted...
            at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1556)
            at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1539)
            at org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1034)
            at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:982)
            at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:947)
            at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.cleanupFailedOpen(OpenRegionHandler.java:365)
            at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:115)
            at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
            at java.lang.Thread.run(Thread.java:744)
        

        Ir crashed on region server.

        I stopped the cluster, restarted it, and then I got one region pending transition for more than 5 minutes.

        2013-12-29 13:22:37,716 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x34335c5090e04bb Attempt to transition the unassigned node for 75c96fb5c15793e04fb71d553a51619b from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENING failed, the node existed but was version 7 not the expected version 6
        2013-12-29 13:22:37,716 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed refreshing OPENING; region=75c96fb5c15793e04fb71d553a51619b, context=post_region_open
        2013-12-29 13:22:37,749 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x34335c5090e04bb Attempt to transition the unassigned node for 75c96fb5c15793e04fb71d553a51619b from RS_ZK_REGION_OPENING to RS_ZK_REGION_FAILED_OPEN failed, the node existed but was in the state M_ZK_REGION_OFFLINE set by the server node1.domain.com,60020,1388341328265
        2013-12-29 13:22:37,751 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Unable to mark region {NAME => 'page,ac.edudlicnep.www\x1Fhttp\x1F-1\x1F/s/ref=sr_nr_p_6_4\x1Frh=n%3A1064954%2Ck%3AArt+Supplies%2Cp_6%3AA22378Z03K0GID&bbn=1064954&keywords=Art+Supplies&ie=UTF8&qid=1343415953&rnid=331539011,1384385444837.75c96fb5c15793e04fb71d553a51619b.', STARTKEY => 'ac.edudlicnep.www\x1Fhttp\x1F-1\x1F/s/ref=sr_nr_p_6_4\x1Frh=n%3A1064954%2Ck%3AArt+Supplies%2Cp_6%3AA22378Z03K0GID&bbn=1064954&keywords=Art+Supplies&ie=UTF8&qid=1343415953&rnid=331539011', ENDKEY => 'ac.efilthgin\x1Fhttp\x1F-1\x1F/directory/all/all/all-virtuelle+four-bois+sport+piano+ecrans-geants+europeen+sandwichs+bar-etudiant+desserts+bluegrass+open-bar+jam\x1Fnull', ENCODED => 75c96fb5c15793e04fb71d553a51619b,} as FAILED_OPEN. It's likely that the master already timed out this open attempt, and thus another RS already has the region.
        

        Then I stopped the master again, and this time it went well.

        So just to test, with default balancer, I tried to balancer again and again, like every 3 minutes to give it a breath between 2 balancing, and I got again a region stuck in transition.

        Show
        Jean-Marc Spaggiari added a comment - I tried the patch, and I think that it just moved the issue further First, I restored default balancer to get normal behaviour. 2013-12-29 13:20:24,408 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server node1.domain.com,60020,1388341141398: Exception refreshing OPENING; region=87dc596f763bd1b43a63c4afd93e4f00, context=post_region_open org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/unassigned/87dc596f763bd1b43a63c4afd93e4f00 at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:349) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:848) at org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNode(ZKAssign.java:811) at org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNode(ZKAssign.java:747) at org.apache.hadoop.hbase.zookeeper.ZKAssign.retransitionNodeOpening(ZKAssign.java:674) at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.tickleOpening(OpenRegionHandler.java:380) at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:108) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang. Thread .run( Thread .java:744) 2013-12-29 13:20:24,413 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: [] 2013-12-29 13:20:24,420 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed refreshing OPENING; region=87dc596f763bd1b43a63c4afd93e4f00, context=post_region_open 2013-12-29 13:20:24,421 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x1427652a35a108f Attempt to transition the unassigned node for 404a7ac95dc8ce89826206453c501e2a from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING failed, the node existed and was in the expected state but then when setting data we got a version mismatch 2013-12-29 13:20:24,423 INFO org.mortbay.log: Stopped SelectChannelConnector@0.0.0.0:60030 2013-12-29 13:20:24,434 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x1427652a35a108f Attempt to transition the unassigned node for 87dc596f763bd1b43a63c4afd93e4f00 from RS_ZK_REGION_OPENING to RS_ZK_REGION_FAILED_OPEN failed, the node existed but was in the state M_ZK_REGION_OFFLINE set by the server node1.domain.com,60020,1388341141398 2013-12-29 13:20:24,435 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Unable to mark region {NAME => 'page,moc.krowtenrehtaeweht.www\x1Fhttp\x1F-1\x1F/gardening/cask0109\x1Fnull,1379303806726.87dc596f763bd1b43a63c4afd93e4f00.', STARTKEY => 'moc.krowtenrehtaeweht.www\x1Fhttp\x1F-1\x1F/gardening/cask0109\x1Fnull', ENDKEY => 'moc.nuhc9.iahgnahs\x1Fhttp\x1F-1\x1F/travels/23865/\x1Fnull', ENCODED => 87dc596f763bd1b43a63c4afd93e4f00,} as FAILED_OPEN. It's likely that the master already timed out this open attempt, and thus another RS already has the region. 2013-12-29 13:20:24,435 ERROR org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while processing event M_RS_OPEN_REGION java.io.IOException: Aborting flush because server is abortted... at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1556) at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1539) at org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1034) at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:982) at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:947) at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.cleanupFailedOpen(OpenRegionHandler.java:365) at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:115) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang. Thread .run( Thread .java:744) Ir crashed on region server. I stopped the cluster, restarted it, and then I got one region pending transition for more than 5 minutes. 2013-12-29 13:22:37,716 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x34335c5090e04bb Attempt to transition the unassigned node for 75c96fb5c15793e04fb71d553a51619b from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENING failed, the node existed but was version 7 not the expected version 6 2013-12-29 13:22:37,716 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed refreshing OPENING; region=75c96fb5c15793e04fb71d553a51619b, context=post_region_open 2013-12-29 13:22:37,749 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x34335c5090e04bb Attempt to transition the unassigned node for 75c96fb5c15793e04fb71d553a51619b from RS_ZK_REGION_OPENING to RS_ZK_REGION_FAILED_OPEN failed, the node existed but was in the state M_ZK_REGION_OFFLINE set by the server node1.domain.com,60020,1388341328265 2013-12-29 13:22:37,751 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Unable to mark region {NAME => 'page,ac.edudlicnep.www\x1Fhttp\x1F-1\x1F/s/ref=sr_nr_p_6_4\x1Frh=n%3A1064954%2Ck%3AArt+Supplies%2Cp_6%3AA22378Z03K0GID&bbn=1064954&keywords=Art+Supplies&ie=UTF8&qid=1343415953&rnid=331539011,1384385444837.75c96fb5c15793e04fb71d553a51619b.', STARTKEY => 'ac.edudlicnep.www\x1Fhttp\x1F-1\x1F/s/ref=sr_nr_p_6_4\x1Frh=n%3A1064954%2Ck%3AArt+Supplies%2Cp_6%3AA22378Z03K0GID&bbn=1064954&keywords=Art+Supplies&ie=UTF8&qid=1343415953&rnid=331539011', ENDKEY => 'ac.efilthgin\x1Fhttp\x1F-1\x1F/directory/all/all/all-virtuelle+four-bois+sport+piano+ecrans-geants+europeen+sandwichs+bar-etudiant+desserts+bluegrass+open-bar+jam\x1Fnull', ENCODED => 75c96fb5c15793e04fb71d553a51619b,} as FAILED_OPEN. It's likely that the master already timed out this open attempt, and thus another RS already has the region. Then I stopped the master again, and this time it went well. So just to test, with default balancer, I tried to balancer again and again, like every 3 minutes to give it a breath between 2 balancing, and I got again a region stuck in transition.
        Hide
        Lars Hofhansl added a comment -

        Let me summarize what I found:

        1. When a region server attempts to open a region and fails it takes the resp. znode to PENDING_OPEN followed by FAILED_OPEN in quick succession.
        2. The HMaster thus get two notifications from ZK.
        3. If the znode transitioned to FAILED_OPEN before the HMaster could react to PENDING_OPEN it will have two concurrent threads that read FAILED_OPEN from the znode. (note that AssignmentManager.handleRegion spawns a new thread to run the ClosedRegionHandler upon FAILED_OPEN).
        4. Now HMaster tries to concurrently assign the same region twice.

        Hence the task is to (a) allow only one outstanding operation per region (b) get out of the way in the 2nd attempt or (c) don't react twice to the same version of the same znode.
        a) and c) are too risky a change in 0.94, I think. But that would be the cleaner avenue.
        b) is what both of my suggested patches do, let the first assignment attempt continue, stop the 2nd

        Jean-Marc Spaggiari,
        In RecoverableZooKeeper.setData(...) I see a specific check for BADVERSION. Are you sure absolutely you're running a consistent version of HBase (and the latest 0.94)?

        This looks like a different (also bad) issue to me. Maybe masked by the master dying earlier, but I doubt it.

        stack,

        So, with this patch, is it the timeout monitor thread that effects the repair and gets us going again? That would be better than a crash.

        The scenario that I found that we tried to assign the same region twice concurrently. Not aborting and not continuing with the assignment when we detect this condition let's the first assignment go through.

        Without a rewrite this is as good as we can do in 0.94. We can't just take the node to OFFLINE, as that would interfere with the first assignment attempt which is still in progress. As I said above we should really only allow one operation at a time for a given region, but that would be a larger rewrite of a hairy piece of he code... Not for 0.94.

        I would like to commit the alt2 patch here and continue to investigate.

        Show
        Lars Hofhansl added a comment - Let me summarize what I found: When a region server attempts to open a region and fails it takes the resp. znode to PENDING_OPEN followed by FAILED_OPEN in quick succession. The HMaster thus get two notifications from ZK. If the znode transitioned to FAILED_OPEN before the HMaster could react to PENDING_OPEN it will have two concurrent threads that read FAILED_OPEN from the znode. (note that AssignmentManager.handleRegion spawns a new thread to run the ClosedRegionHandler upon FAILED_OPEN). Now HMaster tries to concurrently assign the same region twice. Hence the task is to (a) allow only one outstanding operation per region (b) get out of the way in the 2nd attempt or (c) don't react twice to the same version of the same znode. a) and c) are too risky a change in 0.94, I think. But that would be the cleaner avenue. b) is what both of my suggested patches do, let the first assignment attempt continue, stop the 2nd Jean-Marc Spaggiari , In RecoverableZooKeeper.setData(...) I see a specific check for BADVERSION. Are you sure absolutely you're running a consistent version of HBase (and the latest 0.94)? This looks like a different (also bad) issue to me. Maybe masked by the master dying earlier, but I doubt it. stack , So, with this patch, is it the timeout monitor thread that effects the repair and gets us going again? That would be better than a crash. The scenario that I found that we tried to assign the same region twice concurrently. Not aborting and not continuing with the assignment when we detect this condition let's the first assignment go through. Without a rewrite this is as good as we can do in 0.94. We can't just take the node to OFFLINE, as that would interfere with the first assignment attempt which is still in progress. As I said above we should really only allow one operation at a time for a given region, but that would be a larger rewrite of a hairy piece of he code... Not for 0.94. I would like to commit the alt2 patch here and continue to investigate.
        Hide
        Jean-Marc Spaggiari added a comment -

        I found one node running 0.94.3. Node 8. But the version exception has been thrown by node1. I checked node1 with the webui and the jars, and it's all 0.94.15. Master (node3) is running the last 0.94 branch with this patch applied. All the other nodes are all running 0.94.15RC.

        I have updated Node8 to 0.94.15RC and will retry.

        Regarding the patch, I'm +1 with it since it will allow master to start, even if it doesn't fix the issue completely. But seems that, as you said, there is more to do.

        Show
        Jean-Marc Spaggiari added a comment - I found one node running 0.94.3. Node 8. But the version exception has been thrown by node1. I checked node1 with the webui and the jars, and it's all 0.94.15. Master (node3) is running the last 0.94 branch with this patch applied. All the other nodes are all running 0.94.15RC. I have updated Node8 to 0.94.15RC and will retry. Regarding the patch, I'm +1 with it since it will allow master to start, even if it doesn't fix the issue completely. But seems that, as you said, there is more to do.
        Hide
        Jimmy Xiang added a comment -

        Without a rewrite this is as good as we can do in 0.94

        Agreed. +1 on alt2 patch.

        Show
        Jimmy Xiang added a comment - Without a rewrite this is as good as we can do in 0.94 Agreed. +1 on alt2 patch.
        Hide
        Lars Hofhansl added a comment -

        Filed subtask will commit there soon.

        Show
        Lars Hofhansl added a comment - Filed subtask will commit there soon.
        Hide
        Lars Hofhansl added a comment -

        I did a "binary search" through the releases. Looks like this was introduced between 0.94.5 0.94.6.1. Looking through the issues now.

        In 0.94.5 and before and table with a forever bouncing region can be disabled and dropped. Even with my patch in HBASE-10257 that is not possible (the drop with hang).

        Show
        Lars Hofhansl added a comment - I did a "binary search" through the releases. Looks like this was introduced between 0.94.5 0.94.6.1. Looking through the issues now. In 0.94.5 and before and table with a forever bouncing region can be disabled and dropped. Even with my patch in HBASE-10257 that is not possible (the drop with hang).
        Hide
        Lars Hofhansl added a comment -

        Suspects:

        Sergey Shelukhin, mind having a look as well?

        Show
        Lars Hofhansl added a comment - Suspects: HBASE-8040 Race condition in AM after HBASE-7521 (only 0.94) HBASE-7521 fix HBASE-6060 (regions stuck in opening state) in 0.94 Sergey Shelukhin , mind having a look as well?
        Hide
        Lars Hofhansl added a comment -

        I am 99% sure that HBASE-7521 is the culprit, unfortunately is a pretty huge change (I had -0'd it incidentally).
        So ramkrishna.s.vasudevan, Rajesh Babu Devabhaktuni, Jimmy Xiang, please have another look as well. This is really bad!

        Show
        Lars Hofhansl added a comment - I am 99% sure that HBASE-7521 is the culprit, unfortunately is a pretty huge change (I had -0'd it incidentally). So ramkrishna.s.vasudevan , Rajesh Babu Devabhaktuni , Jimmy Xiang , please have another look as well. This is really bad!
        Hide
        ramkrishna.s.vasudevan added a comment -

        Will take a look once again. Infact checked the change that you made. Initially we decided to do that change and always avoided it just because in some cases we were getting some double assignments.
        Thanks Lars.

        Show
        ramkrishna.s.vasudevan added a comment - Will take a look once again. Infact checked the change that you made. Initially we decided to do that change and always avoided it just because in some cases we were getting some double assignments. Thanks Lars.
        Hide
        Lars Hofhansl added a comment -

        There layers of races here it seems.
        With my "fix" not to abort the master a RegionServer can now get a new request to open a region while is it attempting to transition the current attempt in FAILE_OPEN in ZK. It then ignores that concurrent open request, and thus the region will forever be stuck in PENDING_OPEN (unless the timeout manager kicks in of course). The only reason that did not happen before was that the master would just abort before it can reassign the region a second time.

        When I fix that (by taking the region out of RITs before the znode is transitioned) there is an NPE when trying to remove the HRI from the RITs (which was hidden by missing the concurrent assign request before).

        Terrible. I hope this is better in trunk after the AM rewrite. But since this issue stems from backported code, I doubt it.

        Show
        Lars Hofhansl added a comment - There layers of races here it seems. With my "fix" not to abort the master a RegionServer can now get a new request to open a region while is it attempting to transition the current attempt in FAILE_OPEN in ZK. It then ignores that concurrent open request, and thus the region will forever be stuck in PENDING_OPEN (unless the timeout manager kicks in of course). The only reason that did not happen before was that the master would just abort before it can reassign the region a second time. When I fix that (by taking the region out of RITs before the znode is transitioned) there is an NPE when trying to remove the HRI from the RITs (which was hidden by missing the concurrent assign request before). Terrible. I hope this is better in trunk after the AM rewrite. But since this issue stems from backported code, I doubt it.
        Hide
        Lars Hofhansl added a comment -

        OK... The problem was that HRegionServer.removeFromRegionsInTransition does not deal with the case when the region is not in regionsInTransitionInRS (it returns boolean, but the result of the remove(...) is a Boolean, which could be null).

        With that I have good behavior in all cases - that I can reproduce - now. As in 0.94.5 the Master/RS continue to attempt to reassign the region (which always fails in my case because I do not have snappy). So far it did that 1000's of times and did not fail. If anything we might want to rate limit attempts to assign the same region over and over again.

        Will post a patch once SVN is back up.

        Show
        Lars Hofhansl added a comment - OK... The problem was that HRegionServer.removeFromRegionsInTransition does not deal with the case when the region is not in regionsInTransitionInRS (it returns boolean, but the result of the remove(...) is a Boolean, which could be null). With that I have good behavior in all cases - that I can reproduce - now. As in 0.94.5 the Master/RS continue to attempt to reassign the region (which always fails in my case because I do not have snappy). So far it did that 1000's of times and did not fail. If anything we might want to rate limit attempts to assign the same region over and over again. Will post a patch once SVN is back up.
        Hide
        Lars Hofhansl added a comment -

        8912-fix-race fixes all the races I could detect (from looking at the code and running real tests).
        This should be good to commit.

        Show
        Lars Hofhansl added a comment - 8912-fix-race fixes all the races I could detect (from looking at the code and running real tests). This should be good to commit.
        Hide
        Jimmy Xiang added a comment -

        As I said in my comments for HBASE-7521, there are many synchronization issues in 0.94 AM. So it (AM#0.94) has problems to handle racing properly. We can not avoid those racing. We have to deal with them and make sure internal states won't mess up. In trunk, we have better synchronizations so it (AM#trunk) should be better. Jean-Marc Spaggiari, could you run your test on 0.96 and let us know if you can reproduce this on trunk? We have IT with CM all the time and could not see such issues on 0.96/trunk.

        Show
        Jimmy Xiang added a comment - As I said in my comments for HBASE-7521 , there are many synchronization issues in 0.94 AM. So it (AM#0.94) has problems to handle racing properly. We can not avoid those racing. We have to deal with them and make sure internal states won't mess up. In trunk, we have better synchronizations so it (AM#trunk) should be better. Jean-Marc Spaggiari , could you run your test on 0.96 and let us know if you can reproduce this on trunk? We have IT with CM all the time and could not see such issues on 0.96/trunk.
        Hide
        Lars Hofhansl added a comment - - edited

        Thanks Jimmy Xiang. Are you good the -fix-races patch I posted here? It fixes all issues that I can detect.

        Show
        Lars Hofhansl added a comment - - edited Thanks Jimmy Xiang . Are you good the -fix-races patch I posted here? It fixes all issues that I can detect.
        Hide
        Jimmy Xiang added a comment -

        For the fix-races patch, I understand the change to HRegionServer#removeFromRegionsInTransition. For the OpenRegionHandler change, we do call this.rsServices.removeFromRegionsInTransition(this.regionInfo) in the final block. I was wondering how the change will help. It should help if master tries to assign the region to the same host again, which is very common in unit tests. However, if we removed the region from the transition region list before we change the znode, if another openRegion call gets to this server now, it could see wrong znode state. This is unlikely to happen (master assigns the same region using the same znode version). However, in the finally block we could remove the region from transition by mistake (the new transition is removed instead). Will this cause any issue?

        Show
        Jimmy Xiang added a comment - For the fix-races patch, I understand the change to HRegionServer#removeFromRegionsInTransition. For the OpenRegionHandler change, we do call this.rsServices.removeFromRegionsInTransition(this.regionInfo) in the final block. I was wondering how the change will help. It should help if master tries to assign the region to the same host again, which is very common in unit tests. However, if we removed the region from the transition region list before we change the znode, if another openRegion call gets to this server now, it could see wrong znode state. This is unlikely to happen (master assigns the same region using the same znode version). However, in the finally block we could remove the region from transition by mistake (the new transition is removed instead). Will this cause any issue?
        Hide
        Jimmy Xiang added a comment -

        It seems to me we replace one racing with a new one (OpenRegionHandler). If the new one happens less, it is good to me.

        Show
        Jimmy Xiang added a comment - It seems to me we replace one racing with a new one (OpenRegionHandler). If the new one happens less, it is good to me.
        Hide
        Lars Hofhansl added a comment -

        Thanks Jimmy Xiang. The idea was that we remove the region from RITs before we transition the znode. So there should no race anymore since the master cannot know about the FAILED_OPEN before the region was removed from RITs, and since the master also avoids concurrent assigns of the same region now.

        Cool. I'll wait for an hour or two and then commit and resolve.

        Show
        Lars Hofhansl added a comment - Thanks Jimmy Xiang . The idea was that we remove the region from RITs before we transition the znode. So there should no race anymore since the master cannot know about the FAILED_OPEN before the region was removed from RITs, and since the master also avoids concurrent assigns of the same region now. Cool. I'll wait for an hour or two and then commit and resolve.
        Hide
        Lars Hofhansl added a comment -

        Committed this to 0.94

        Show
        Lars Hofhansl added a comment - Committed this to 0.94
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in HBase-0.94-security #377 (See https://builds.apache.org/job/HBase-0.94-security/377/)
        HBASE-8912 [0.94] AssignmentManager throws IllegalStateException from PENDING_OPEN to OFFLINE (larsh: rev 1554933)

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java
        Show
        Hudson added a comment - SUCCESS: Integrated in HBase-0.94-security #377 (See https://builds.apache.org/job/HBase-0.94-security/377/ ) HBASE-8912 [0.94] AssignmentManager throws IllegalStateException from PENDING_OPEN to OFFLINE (larsh: rev 1554933) /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java
        Hide
        Hudson added a comment -

        FAILURE: Integrated in HBase-0.94-JDK7 #7 (See https://builds.apache.org/job/HBase-0.94-JDK7/7/)
        HBASE-8912 [0.94] AssignmentManager throws IllegalStateException from PENDING_OPEN to OFFLINE (larsh: rev 1554933)

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java
        Show
        Hudson added a comment - FAILURE: Integrated in HBase-0.94-JDK7 #7 (See https://builds.apache.org/job/HBase-0.94-JDK7/7/ ) HBASE-8912 [0.94] AssignmentManager throws IllegalStateException from PENDING_OPEN to OFFLINE (larsh: rev 1554933) /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java
        Hide
        Hudson added a comment -

        ABORTED: Integrated in HBase-0.94 #1245 (See https://builds.apache.org/job/HBase-0.94/1245/)
        HBASE-8912 [0.94] AssignmentManager throws IllegalStateException from PENDING_OPEN to OFFLINE (larsh: rev 1554933)

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java
        Show
        Hudson added a comment - ABORTED: Integrated in HBase-0.94 #1245 (See https://builds.apache.org/job/HBase-0.94/1245/ ) HBASE-8912 [0.94] AssignmentManager throws IllegalStateException from PENDING_OPEN to OFFLINE (larsh: rev 1554933) /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java
        Hide
        Liang Xie added a comment -

        Jimmy Xiang

        As I said in my comments for HBASE-7521, there are many synchronization issues in 0.94 AM. So it (AM#0.94) has problems to handle racing properly. We can not avoid those racing. We have to deal with them and make sure internal states won't mess up. In trunk, we have better synchronizations so it (AM#trunk) should be better. Jean-Marc Spaggiari, could you run your test on 0.96 and let us know if you can reproduce this on trunk? We have IT with CM all the time and could not see such issues on 0.96/trunk.

        Do you have any advice or suggestion to avoid or reduce this risk for a cluster operator on latest 0.94 branches ? i asked this, because, to us(Xiaomi), we use 0.94.3 branch for almost one year without any serious AM issue in the product. we have a plan to upgrade to 0.94.1x this month probably. and after looking into current issue, it makes me worry a little now...

        Show
        Liang Xie added a comment - Jimmy Xiang As I said in my comments for HBASE-7521 , there are many synchronization issues in 0.94 AM. So it (AM#0.94) has problems to handle racing properly. We can not avoid those racing. We have to deal with them and make sure internal states won't mess up. In trunk, we have better synchronizations so it (AM#trunk) should be better. Jean-Marc Spaggiari, could you run your test on 0.96 and let us know if you can reproduce this on trunk? We have IT with CM all the time and could not see such issues on 0.96/trunk. Do you have any advice or suggestion to avoid or reduce this risk for a cluster operator on latest 0.94 branches ? i asked this, because, to us(Xiaomi), we use 0.94.3 branch for almost one year without any serious AM issue in the product. we have a plan to upgrade to 0.94.1x this month probably. and after looking into current issue, it makes me worry a little now...
        Hide
        Lars Hofhansl added a comment -

        The race at issue here was introduced in 0.94.6. No doubt there are other races, which lead to the rewrite of the AM in 0.96.

        Show
        Lars Hofhansl added a comment - The race at issue here was introduced in 0.94.6. No doubt there are other races, which lead to the rewrite of the AM in 0.96.
        Hide
        ramkrishna.s.vasudevan added a comment -

        I checked out the fix race patch. I think removing the transition map may trigger another assignment. Is it possible to handle this in the master side only?

        Show
        ramkrishna.s.vasudevan added a comment - I checked out the fix race patch. I think removing the transition map may trigger another assignment. Is it possible to handle this in the master side only?
        Hide
        ramkrishna.s.vasudevan added a comment -

        Am attaching a patch here. Makes FAILED_OPEN to be processed in the same thread and just adds a set with synchronization so that it avoids multiple threads changing the RIT state in memory. In cases where SNAPPY is not installed and there are more than one RS where SNAPPY is installed, this should work. (Am telling this as a test scenario).
        I have not run the UTs.
        Lars Hofhansl, Jimmy Xiang
        Let me know what do you think. Even if we leave the ClosedRegionHandler to operate in a seperate thread it should be ok I think.

        Show
        ramkrishna.s.vasudevan added a comment - Am attaching a patch here. Makes FAILED_OPEN to be processed in the same thread and just adds a set with synchronization so that it avoids multiple threads changing the RIT state in memory. In cases where SNAPPY is not installed and there are more than one RS where SNAPPY is installed, this should work. (Am telling this as a test scenario). I have not run the UTs. Lars Hofhansl , Jimmy Xiang Let me know what do you think. Even if we leave the ClosedRegionHandler to operate in a seperate thread it should be ok I think.
        Hide
        ramkrishna.s.vasudevan added a comment -

        The change is small but since try/finally is introduced it is showing as a big one.

        Show
        ramkrishna.s.vasudevan added a comment - The change is small but since try/finally is introduced it is showing as a big one.
        Hide
        Jimmy Xiang added a comment -

        Liang Xie, have you considered to upgrade to 0.96.1? IT with CM shows it's very stable.

        Show
        Jimmy Xiang added a comment - Liang Xie , have you considered to upgrade to 0.96.1? IT with CM shows it's very stable.
        Hide
        Jimmy Xiang added a comment -

        ramkrishna.s.vasudevan, my only concern using the same thread is that AM has only one ZK event thread in 0.94 (in trunk we have a pool of threads).

        Show
        Jimmy Xiang added a comment - ramkrishna.s.vasudevan , my only concern using the same thread is that AM has only one ZK event thread in 0.94 (in trunk we have a pool of threads).
        Hide
        ramkrishna.s.vasudevan added a comment -
          this.executorService.submit(new ClosedRegionHandler(master,
                      this, regionState.getRegion()));
        

        So bringing back this change and having the remaining would work?

        Show
        ramkrishna.s.vasudevan added a comment - this .executorService.submit( new ClosedRegionHandler(master, this , regionState.getRegion())); So bringing back this change and having the remaining would work?
        Hide
        Lars Hofhansl added a comment -

        I had been considering this, but we'd block the event thread for a long time. The best solution IMHO is to only allow one operation per region at a time, but that is hard to fit in (I think) without rewriting most of the logic.

        If you bring back running the ClosedRegionHandler asynchronously you'd have the same issues, right? (the region will be removed from failedOpenRegions in the finally block, while the ClosedRegionHandler is still running - the ClosedRegionHandler might not even have started)

        Actually what is the issue? What race condition are you referring to ramkrishna.s.vasudevan?

        Show
        Lars Hofhansl added a comment - I had been considering this, but we'd block the event thread for a long time. The best solution IMHO is to only allow one operation per region at a time, but that is hard to fit in (I think) without rewriting most of the logic. If you bring back running the ClosedRegionHandler asynchronously you'd have the same issues, right? (the region will be removed from failedOpenRegions in the finally block, while the ClosedRegionHandler is still running - the ClosedRegionHandler might not even have started) Actually what is the issue? What race condition are you referring to ramkrishna.s.vasudevan ?
        Hide
        Jean-Marc Spaggiari added a comment -

        As asked, I tried to play a bit with 0.96.1.1.

        I trying to do major balancing. Killed some servers to get all the regions assigned to a single server, restarted the other servers and run balancer. Then killed again all servers expect one. Stopped HBase, restart all the servers, get the regions balanced again. Also ran some major compactions to make sure regions are moving in hadoop too.

        The only thing is that upon restart all regions are rebalanced when I will have expected them to stay where they where assigned before, but not related to this issue.

        Conclusion: Seems to not affect 0.96.1.1

        Show
        Jean-Marc Spaggiari added a comment - As asked, I tried to play a bit with 0.96.1.1. I trying to do major balancing. Killed some servers to get all the regions assigned to a single server, restarted the other servers and run balancer. Then killed again all servers expect one. Stopped HBase, restart all the servers, get the regions balanced again. Also ran some major compactions to make sure regions are moving in hadoop too. The only thing is that upon restart all regions are rebalanced when I will have expected them to stay where they where assigned before, but not related to this issue. Conclusion: Seems to not affect 0.96.1.1
        Hide
        Jimmy Xiang added a comment -

        Jean-Marc Spaggiari, thanks a lot!

        Show
        Jimmy Xiang added a comment - Jean-Marc Spaggiari , thanks a lot!
        Hide
        Lars Hofhansl added a comment -

        Jean-Marc Spaggiari, Cool! What happens in 0.94 with this and HBASE-10257 applied?

        Show
        Lars Hofhansl added a comment - Jean-Marc Spaggiari , Cool! What happens in 0.94 with this and HBASE-10257 applied?
        Hide
        Jean-Marc Spaggiari added a comment -

        I will rebuild deploy test and let you know right now. I will take the last 0.94 branch and apply both HBASE-8912 and HBASE-10257...

        Show
        Jean-Marc Spaggiari added a comment - I will rebuild deploy test and let you know right now. I will take the last 0.94 branch and apply both HBASE-8912 and HBASE-10257 ...
        Hide
        Jean-Marc Spaggiari added a comment - - edited

        After the first restart, 36 regions are stuck in transition But not any server crashed.

        What I did:

        • Restored default balancer to make sure as much regions as possible will move.
        • Stop/start HBase
        • Run balancer from shell.

        Every thing is back up after a 2nd restart.

        I get many errors like this one:

        2014-01-03 16:03:03,958 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received FAILED_OPEN for region b75cb9067c3c4456d6198c9237c143b3 from server node4.domain.com,60020,1388782921790 but region was in  the state page,rf.idua.www\x1Fhttp\x1F-1\x1F/fr/brand/fr/audi_fleet_solutions/contact/contact_transport_personnes.html\x1Fnull,1379103792232.b75cb9067c3c4456d6198c9237c143b3. state=CLOSED, ts=1388782983373, server=node4.domain.com,60020,1388782921790 and not in OFFLINE, PENDING_OPEN or OPENING
        

        After investigations, I figured that snappy was missing on a server. I fixed that, restart: All seems to be fine. So I restored my customized balancer, restart, balanced.

        Still some warning in the logs:

        2014-01-03 16:21:52,864 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region db8e67acde26bf340da481d3c1b934cd from server node4.domain.com,60020,1388784051197 but region was in  the state page,moc.tenretnigruoboc.www\x1Fhttp\x1F-1\x1F/cobourg-and-the-web\x1Fnull,1379103844627.db8e67acde26bf340da481d3c1b934cd. state=OPEN, ts=1388784100392, server=node4.domain.com,60020,1388784051197 and not in expected OFFLINE, PENDING_OPEN or OPENING states
        

        But this time all the regions are assigned correctly.

        I did that one more time (change balancer, stop, start, balance. Change balancer, stop, start, balance). I turned loglevel to warn.

        2014-01-03 16:28:51,142 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region 17bee313797fc1ce982c0e31fdb6620c from server node8.domain.com,60020,1388784498327 but region was in  the state page,rf.ofniecnarf.www\x1Fhttp\x1F-1\x1F/vote/comment/27996/1/vote/zero_vote/c99b0992e5a9cd6bf3a4cfc91769ceeb\x1Fnull,1379104524006.17bee313797fc1ce982c0e31fdb6620c. state=OPEN, ts=1388784531048, server=node8.domain.com,60020,1388784498327 and not in expected OFFLINE, PENDING_OPEN or OPENING states
        2014-01-03 16:28:52,135 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region 6dc6290df1855b319f60bf89faa3da41 from server node8.domain.com,60020,1388784498327 but region was in  the state page_crc,\x00\x00\x00\x00\xD7\xD9\x97\x8Bvideo.k-wreview.ca,1378042601904.6dc6290df1855b319f60bf89faa3da41. state=OPEN, ts=1388784531793, server=node8.domain.com,60020,1388784498327 and not in expected OFFLINE, PENDING_OPEN or OPENING states
        2014-01-03 16:28:52,712 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region ec4f96b6cedd935aeba279b15d5337af from server node8.domain.com,60020,1388784498327 but region was in  the state work_proposed,\x98\xBF\xAF\x90\x00\x00\x00\x00http://feedproxy.google.com/~r/WheatWeeds/~3/Of24fZKcpco/the-eighth-day-of-christmas.html,1378975430143.ec4f96b6cedd935aeba279b15d5337af. state=OPEN, ts=1388784532540, server=node8.domain.com,60020,1388784498327 and not in expected OFFLINE, PENDING_OPEN or OPENING states
        2014-01-03 16:28:52,747 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region 4f823b5de664556a89cbd86aa41cd0b0 from server node8.domain.com,60020,1388784498327 but region was in  the state work_proposed,\x8D4K\xEA\x00\x00\x00\x00http://twitter.com/home?status=CartoonStock%3A++http%3A%2F%2Fwww%2Ecartoonstock%2Ecom%2Fdirectory%2Fc%2Fcream%5Ftea%5Fgifts%2Easp,1378681682935.4f823b5de664556a89cbd86aa41cd0b0. state=OPEN, ts=1388784532552, server=node8.domain.com,60020,1388784498327 and not in expected OFFLINE, PENDING_OPEN or OPENING states
        2014-01-03 16:28:53,244 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region da0bd0a6b7187f731fb34d4ac14ca279 from server node8.domain.com,60020,1388784498327 but region was in  the state work_proposed,\xB2\xE6\xB6\xBB\x00\x00\x00\x00http://www.canpages.ca/page/QC/notre-dame-des-prairies/concept-beton-design/4550984.html,1378737981443.da0bd0a6b7187f731fb34d4ac14ca279. state=OPEN, ts=1388784533203, server=node8.domain.com,60020,1388784498327 and not in expected OFFLINE, PENDING_OPEN or OPENING states
        

        But everything finally got assigned without any restart required, any pretty quickly.

        Logs from the last run:

        2014-01-03 16:32:20,252 WARN org.apache.hadoop.ipc.HBaseServer: (responseTooSlow): {"processingtimems":10969,"call":"balance(), rpc version=1, client version=29, methodsFingerPrint=1886733559","client":"192.168.23.7:54614","starttimems":1388784729247,"queuetimems":0,"class":"HMaster","responsesize":0,"method":"balance"}
        2014-01-03 16:32:21,278 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region 043d45cada6185d86e743754957e579a from server node1.domain.com,60020,1388784692832 but region was in  the state page,moc.yubreffotseb.www\x1Fhttp\x1F-1\x1F/camera-bags-cases-straps-camera-bags-cases-c-282_888_580.html\x1Fzenid=ji3nr2ps8rnbaa7joc0lv4qln2,1388782516646.043d45cada6185d86e743754957e579a. state=OPEN, ts=1388784735731, server=node1.domain.com,60020,1388784692832 and not in expected OFFLINE, PENDING_OPEN or OPENING states
        2014-01-03 16:32:21,713 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region e097f829b6eafb70b30c254fd4af662c from server node1.domain.com,60020,1388784692832 but region was in  the state page,ac.usneeuq.ssenisub\x1Fhttp\x1F-1\x1F/grad_studies/PHD/about_us/queens_leaders_forum/about_us/about_us/grad_studies/PHD/student_career_services/queens_leaders_forum/recruiting/news/recruiting/about_us/accreditations.php\x1Fnull,1383168138496.e097f829b6eafb70b30c254fd4af662c. state=OPEN, ts=1388784736528, server=node1.domain.com,60020,1388784692832 and not in expected OFFLINE, PENDING_OPEN or OPENING states
        2014-01-03 16:32:26,862 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region c94705f4a7c23a0a05a01bfe9d7755bc from server node1.domain.com,60020,1388784692832 but region was in  the state entry,christian_labelle,1377000858428.c94705f4a7c23a0a05a01bfe9d7755bc. state=OPEN, ts=1388784740724, server=node1.domain.com,60020,1388784692832 and not in expected OFFLINE, PENDING_OPEN or OPENING states
        2014-01-03 16:32:34,516 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region a8d60db86bd03cfbfba0bae0bb3cb564 from server node1.domain.com,60020,1388784692832 but region was in  the state work_proposed,W{8\xB2\x00\x00\x00\x00http://www.prairiesouth.ca/williamgrayson/calendar-mainmenu-26/day.listevents/2013/10/29/23.html,1383415634227.a8d60db86bd03cfbfba0bae0bb3cb564. state=OPEN, ts=1388784754264, server=node1.domain.com,60020,1388784692832 and not in expected OFFLINE, PENDING_OPEN or OPENING states
        

        So overall, it's WAY more stable! I have not been able to get anything stuck or crashed with the 2 patchs applied. I will keep them Big +1 from me. Thanks for fixing that. I think it might be easy for fix the last few remaining warnings...

        Show
        Jean-Marc Spaggiari added a comment - - edited After the first restart, 36 regions are stuck in transition But not any server crashed. What I did: Restored default balancer to make sure as much regions as possible will move. Stop/start HBase Run balancer from shell. Every thing is back up after a 2nd restart. I get many errors like this one: 2014-01-03 16:03:03,958 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received FAILED_OPEN for region b75cb9067c3c4456d6198c9237c143b3 from server node4.domain.com,60020,1388782921790 but region was in the state page,rf.idua.www\x1Fhttp\x1F-1\x1F/fr/brand/fr/audi_fleet_solutions/contact/contact_transport_personnes.html\x1Fnull,1379103792232.b75cb9067c3c4456d6198c9237c143b3. state=CLOSED, ts=1388782983373, server=node4.domain.com,60020,1388782921790 and not in OFFLINE, PENDING_OPEN or OPENING After investigations, I figured that snappy was missing on a server. I fixed that, restart: All seems to be fine. So I restored my customized balancer, restart, balanced. Still some warning in the logs: 2014-01-03 16:21:52,864 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region db8e67acde26bf340da481d3c1b934cd from server node4.domain.com,60020,1388784051197 but region was in the state page,moc.tenretnigruoboc.www\x1Fhttp\x1F-1\x1F/cobourg-and-the-web\x1Fnull,1379103844627.db8e67acde26bf340da481d3c1b934cd. state=OPEN, ts=1388784100392, server=node4.domain.com,60020,1388784051197 and not in expected OFFLINE, PENDING_OPEN or OPENING states But this time all the regions are assigned correctly. I did that one more time (change balancer, stop, start, balance. Change balancer, stop, start, balance). I turned loglevel to warn. 2014-01-03 16:28:51,142 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region 17bee313797fc1ce982c0e31fdb6620c from server node8.domain.com,60020,1388784498327 but region was in the state page,rf.ofniecnarf.www\x1Fhttp\x1F-1\x1F/vote/comment/27996/1/vote/zero_vote/c99b0992e5a9cd6bf3a4cfc91769ceeb\x1Fnull,1379104524006.17bee313797fc1ce982c0e31fdb6620c. state=OPEN, ts=1388784531048, server=node8.domain.com,60020,1388784498327 and not in expected OFFLINE, PENDING_OPEN or OPENING states 2014-01-03 16:28:52,135 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region 6dc6290df1855b319f60bf89faa3da41 from server node8.domain.com,60020,1388784498327 but region was in the state page_crc,\x00\x00\x00\x00\xD7\xD9\x97\x8Bvideo.k-wreview.ca,1378042601904.6dc6290df1855b319f60bf89faa3da41. state=OPEN, ts=1388784531793, server=node8.domain.com,60020,1388784498327 and not in expected OFFLINE, PENDING_OPEN or OPENING states 2014-01-03 16:28:52,712 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region ec4f96b6cedd935aeba279b15d5337af from server node8.domain.com,60020,1388784498327 but region was in the state work_proposed,\x98\xBF\xAF\x90\x00\x00\x00\x00http: //feedproxy.google.com/~r/WheatWeeds/~3/Of24fZKcpco/the-eighth-day-of-christmas.html,1378975430143.ec4f96b6cedd935aeba279b15d5337af. state=OPEN, ts=1388784532540, server=node8.domain.com,60020,1388784498327 and not in expected OFFLINE, PENDING_OPEN or OPENING states 2014-01-03 16:28:52,747 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region 4f823b5de664556a89cbd86aa41cd0b0 from server node8.domain.com,60020,1388784498327 but region was in the state work_proposed,\x8D4K\xEA\x00\x00\x00\x00http: //twitter.com/home?status=CartoonStock%3A++http%3A%2F%2Fwww%2Ecartoonstock%2Ecom%2Fdirectory%2Fc%2Fcream%5Ftea%5Fgifts%2Easp,1378681682935.4f823b5de664556a89cbd86aa41cd0b0. state=OPEN, ts=1388784532552, server=node8.domain.com,60020,1388784498327 and not in expected OFFLINE, PENDING_OPEN or OPENING states 2014-01-03 16:28:53,244 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region da0bd0a6b7187f731fb34d4ac14ca279 from server node8.domain.com,60020,1388784498327 but region was in the state work_proposed,\xB2\xE6\xB6\xBB\x00\x00\x00\x00http: //www.canpages.ca/page/QC/notre-dame-des-prairies/concept-beton-design/4550984.html,1378737981443.da0bd0a6b7187f731fb34d4ac14ca279. state=OPEN, ts=1388784533203, server=node8.domain.com,60020,1388784498327 and not in expected OFFLINE, PENDING_OPEN or OPENING states But everything finally got assigned without any restart required, any pretty quickly. Logs from the last run: 2014-01-03 16:32:20,252 WARN org.apache.hadoop.ipc.HBaseServer: (responseTooSlow): { "processingtimems" :10969, "call" : "balance(), rpc version=1, client version=29, methodsFingerPrint=1886733559" , "client" : "192.168.23.7:54614" , "starttimems" :1388784729247, "queuetimems" :0, "class" : "HMaster" , "responsesize" :0, "method" : "balance" } 2014-01-03 16:32:21,278 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region 043d45cada6185d86e743754957e579a from server node1.domain.com,60020,1388784692832 but region was in the state page,moc.yubreffotseb.www\x1Fhttp\x1F-1\x1F/camera-bags-cases-straps-camera-bags-cases-c-282_888_580.html\x1Fzenid=ji3nr2ps8rnbaa7joc0lv4qln2,1388782516646.043d45cada6185d86e743754957e579a. state=OPEN, ts=1388784735731, server=node1.domain.com,60020,1388784692832 and not in expected OFFLINE, PENDING_OPEN or OPENING states 2014-01-03 16:32:21,713 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region e097f829b6eafb70b30c254fd4af662c from server node1.domain.com,60020,1388784692832 but region was in the state page,ac.usneeuq.ssenisub\x1Fhttp\x1F-1\x1F/grad_studies/PHD/about_us/queens_leaders_forum/about_us/about_us/grad_studies/PHD/student_career_services/queens_leaders_forum/recruiting/news/recruiting/about_us/accreditations.php\x1Fnull,1383168138496.e097f829b6eafb70b30c254fd4af662c. state=OPEN, ts=1388784736528, server=node1.domain.com,60020,1388784692832 and not in expected OFFLINE, PENDING_OPEN or OPENING states 2014-01-03 16:32:26,862 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region c94705f4a7c23a0a05a01bfe9d7755bc from server node1.domain.com,60020,1388784692832 but region was in the state entry,christian_labelle,1377000858428.c94705f4a7c23a0a05a01bfe9d7755bc. state=OPEN, ts=1388784740724, server=node1.domain.com,60020,1388784692832 and not in expected OFFLINE, PENDING_OPEN or OPENING states 2014-01-03 16:32:34,516 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region a8d60db86bd03cfbfba0bae0bb3cb564 from server node1.domain.com,60020,1388784692832 but region was in the state work_proposed,W{8\xB2\x00\x00\x00\x00http: //www.prairiesouth.ca/williamgrayson/calendar-mainmenu-26/day.listevents/2013/10/29/23.html,1383415634227.a8d60db86bd03cfbfba0bae0bb3cb564. state=OPEN, ts=1388784754264, server=node1.domain.com,60020,1388784692832 and not in expected OFFLINE, PENDING_OPEN or OPENING states So overall, it's WAY more stable! I have not been able to get anything stuck or crashed with the 2 patchs applied. I will keep them Big +1 from me. Thanks for fixing that. I think it might be easy for fix the last few remaining warnings...
        Hide
        Lars Hofhansl added a comment -

        Cool, thanks JM! The warnings are expected to some extend.
        The 36 regions that got stuck after the first restart, were they assigned to the RS that had SNAPPY missing?

        Show
        Lars Hofhansl added a comment - Cool, thanks JM! The warnings are expected to some extend. The 36 regions that got stuck after the first restart, were they assigned to the RS that had SNAPPY missing?
        Hide
        Jean-Marc Spaggiari added a comment -

        Yep, all 36 regions belonged to a server whith Snappy issues. It will have been nice to see them re-assigned successfuly somewhere else. I can reproduce that easily, I just need to remove the snappy so file...

        Also, regarding the warning, if I clear what, that gives that:
        2014-01-03 16:32:21,278 WARN AssignmentManager: Received OPENED for region 0...a from server node1 but region was in the state state=OPEN, and not in expected OFFLINE, PENDING_OPEN or OPENING states

        If we get OPENED and region is OPEN, can we not simply discard the warning? That mean region is fine, we got a request to open it and it's already done. So why should we worry?

        Show
        Jean-Marc Spaggiari added a comment - Yep, all 36 regions belonged to a server whith Snappy issues. It will have been nice to see them re-assigned successfuly somewhere else. I can reproduce that easily, I just need to remove the snappy so file... Also, regarding the warning, if I clear what, that gives that: 2014-01-03 16:32:21,278 WARN AssignmentManager: Received OPENED for region 0...a from server node1 but region was in the state state=OPEN, and not in expected OFFLINE, PENDING_OPEN or OPENING states If we get OPENED and region is OPEN, can we not simply discard the warning? That mean region is fine, we got a request to open it and it's already done. So why should we worry?
        Hide
        Lars Hofhansl added a comment -

        Yeah, would be nice if the AM could retain a history of assignments of a region and avoid retrying the same RS over and over, it should also do per region rate limiting. Too risky to add this to 0.94, though.

        As for the warning... You are probably right. The warning might still be an indication for double assignments (i.e. the region was OPEN already as far as the AM was concerned and yet it got another OPENED message from ZK).
        I think in 0.94 we should leave the warning in, in case we see more issues here in the future. In 0.96+ it's not an issue anyway.

        Show
        Lars Hofhansl added a comment - Yeah, would be nice if the AM could retain a history of assignments of a region and avoid retrying the same RS over and over, it should also do per region rate limiting. Too risky to add this to 0.94, though. As for the warning... You are probably right. The warning might still be an indication for double assignments (i.e. the region was OPEN already as far as the AM was concerned and yet it got another OPENED message from ZK). I think in 0.94 we should leave the warning in, in case we see more issues here in the future. In 0.96+ it's not an issue anyway.
        Hide
        ramkrishna.s.vasudevan added a comment -

        the abort statement me and my old colleagues we thought of removing but did not do that just because it was leading to other issues.
        The fix here could lead to double assignments and as Jimmy said considering it as a lesser occurence ( i may not use 'rare' here), it is ok to go with this.

        Show
        ramkrishna.s.vasudevan added a comment - the abort statement me and my old colleagues we thought of removing but did not do that just because it was leading to other issues. The fix here could lead to double assignments and as Jimmy said considering it as a lesser occurence ( i may not use 'rare' here), it is ok to go with this.
        Hide
        Lars Hofhansl added a comment -

        Could you outline a scenario where we'd have double assignment? The patch should replace the abort with a no-op, so double assignment still should not happen.

        Show
        Lars Hofhansl added a comment - Could you outline a scenario where we'd have double assignment? The patch should replace the abort with a no-op, so double assignment still should not happen.

          People

          • Assignee:
            Lars Hofhansl
            Reporter:
            Enis Soztutar
          • Votes:
            4 Vote for this issue
            Watchers:
            17 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development