HBase
  1. HBase
  2. HBASE-7034

Bad version, failed OPENING to OPENED but master thinks it is open anyways

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.94.2
    • Fix Version/s: 0.94.5, 0.95.0
    • Component/s: Region Assignment
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      I have this in RS log:

      2012-10-22 02:21:50,698 ERROR org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed transitioning node b9,\xEE\xAE\x9BiQO\x89]+a\xE0\x7F\xB7'X?,1349052737638.9af7cfc9b15910a0b3d714bf40a3248f. from OPENING to OPENED -- closing region
      org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/unassigned/9af7cfc9b15910a0b3d714bf40a3248f
      

      Master says this (it is bulk assigning):

      ....
      2012-10-22 02:21:40,673 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:10302-0xb3a862e57a503ba Set watcher on existing znode /hbase/unassigned/9af7cfc9b15910a0b3d714bf40a3248f
      ...
      
      then this
      ....
      
      2012-10-22 02:23:47,089 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:10302-0xb3a862e57a503ba Set watcher on existing znode /hbase/unassigned/9af7cfc9b15910a0b3d714bf40a3248f
      ....
      2012-10-22 02:24:34,176 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:10302-0xb3a862e57a503ba Retrieved 112 byte(s) of data from znode /hbase/unassigned/9af7cfc9b15910a0b3d714bf40a3248f and set watcher; region=b9,\xEE\xAE\x9BiQO\x89]+a\xE0\x7F\xB7'X?,1349052737638.9af7cfc9b15910a0b3d714bf40a3248f., origin=sv4r17s44,10304,1350872216778, state=RS_ZK_REGION_OPENED
      
      etc.
      

      Disagreement as to what is going on here.

      1. TestRecoverableZooKeeper.java
        3 kB
        Anoop Sam John
      2. HBASE-7034_Test_Trunk.patch
        5 kB
        Anoop Sam John
      3. HBASE-7034_94.patch
        1 kB
        Anoop Sam John
      4. HBASE-7034_94_V2.patch
        6 kB
        Anoop Sam John

        Activity

        Hide
        Hudson added a comment -

        Integrated in HBase-0.94-security-on-Hadoop-23 #11 (See https://builds.apache.org/job/HBase-0.94-security-on-Hadoop-23/11/)
        HBASE-7034 Bad version, failed OPENING to OPENED but master thinks it is open anyways (Anoop) (Revision 1434797)

        Result = FAILURE
        tedyu :
        Files :

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/zookeeper/RecoverableZooKeeper.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/zookeeper/TestRecoverableZooKeeper.java
        Show
        Hudson added a comment - Integrated in HBase-0.94-security-on-Hadoop-23 #11 (See https://builds.apache.org/job/HBase-0.94-security-on-Hadoop-23/11/ ) HBASE-7034 Bad version, failed OPENING to OPENED but master thinks it is open anyways (Anoop) (Revision 1434797) Result = FAILURE tedyu : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/zookeeper/RecoverableZooKeeper.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/zookeeper/TestRecoverableZooKeeper.java
        Hide
        Hudson added a comment -

        Integrated in HBase-0.94-security #96 (See https://builds.apache.org/job/HBase-0.94-security/96/)
        HBASE-7034 Bad version, failed OPENING to OPENED but master thinks it is open anyways (Anoop) (Revision 1434797)

        Result = FAILURE
        tedyu :
        Files :

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/zookeeper/RecoverableZooKeeper.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/zookeeper/TestRecoverableZooKeeper.java
        Show
        Hudson added a comment - Integrated in HBase-0.94-security #96 (See https://builds.apache.org/job/HBase-0.94-security/96/ ) HBASE-7034 Bad version, failed OPENING to OPENED but master thinks it is open anyways (Anoop) (Revision 1434797) Result = FAILURE tedyu : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/zookeeper/RecoverableZooKeeper.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/zookeeper/TestRecoverableZooKeeper.java
        Hide
        ramkrishna.s.vasudevan added a comment -

        HBASE-6858 has the fix for trunk.

        Show
        ramkrishna.s.vasudevan added a comment - HBASE-6858 has the fix for trunk.
        Hide
        Hudson added a comment -

        Integrated in HBase-0.94 #743 (See https://builds.apache.org/job/HBase-0.94/743/)
        HBASE-7034 Bad version, failed OPENING to OPENED but master thinks it is open anyways (Anoop) (Revision 1434797)

        Result = FAILURE
        tedyu :
        Files :

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/zookeeper/RecoverableZooKeeper.java
        • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/zookeeper/TestRecoverableZooKeeper.java
        Show
        Hudson added a comment - Integrated in HBase-0.94 #743 (See https://builds.apache.org/job/HBase-0.94/743/ ) HBASE-7034 Bad version, failed OPENING to OPENED but master thinks it is open anyways (Anoop) (Revision 1434797) Result = FAILURE tedyu : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/zookeeper/RecoverableZooKeeper.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/zookeeper/TestRecoverableZooKeeper.java
        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK #3762 (See https://builds.apache.org/job/HBase-TRUNK/3762/)
        HBASE-7034 Bad version, failed OPENING to OPENED but master thinks it is open anyways (Anoop) (Revision 1434800)

        Result = FAILURE
        tedyu :
        Files :

        • /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/zookeeper/TestRecoverableZooKeeper.java
        Show
        Hudson added a comment - Integrated in HBase-TRUNK #3762 (See https://builds.apache.org/job/HBase-TRUNK/3762/ ) HBASE-7034 Bad version, failed OPENING to OPENED but master thinks it is open anyways (Anoop) (Revision 1434800) Result = FAILURE tedyu : Files : /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/zookeeper/TestRecoverableZooKeeper.java
        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #355 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/355/)
        HBASE-7034 Bad version, failed OPENING to OPENED but master thinks it is open anyways (Anoop) (Revision 1434800)

        Result = FAILURE
        tedyu :
        Files :

        • /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/zookeeper/TestRecoverableZooKeeper.java
        Show
        Hudson added a comment - Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #355 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/355/ ) HBASE-7034 Bad version, failed OPENING to OPENED but master thinks it is open anyways (Anoop) (Revision 1434800) Result = FAILURE tedyu : Files : /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/zookeeper/TestRecoverableZooKeeper.java
        Hide
        Ted Yu added a comment -

        Ran the new test and it passed in both 0.94 and trunk.

        Integrated to 0.94 and trunk.

        Thanks for the patch, Anoop.

        Thanks for the review, Ram.

        Show
        Ted Yu added a comment - Ran the new test and it passed in both 0.94 and trunk. Integrated to 0.94 and trunk. Thanks for the patch, Anoop. Thanks for the review, Ram.
        Hide
        Anoop Sam John added a comment -

        It would be nice to have a trunk patch which contains the new test.

        Let me make that then Ted.
        Didn't seen your comment. I was trying to apply the fix in trunk

        Show
        Anoop Sam John added a comment - It would be nice to have a trunk patch which contains the new test. Let me make that then Ted. Didn't seen your comment. I was trying to apply the fix in trunk
        Hide
        Anoop Sam John added a comment -

        Oh.. Now when I seen the trunk code to apply the fix, it is already fixed. How we missed 94 then!!

        Show
        Anoop Sam John added a comment - Oh.. Now when I seen the trunk code to apply the fix, it is already fixed. How we missed 94 then!!
        Hide
        Ted Yu added a comment -

        +1 on patch v2.

        Patch v2 aligns with trunk code. It would be nice to have a trunk patch which contains the new test.

        Show
        Ted Yu added a comment - +1 on patch v2. Patch v2 aligns with trunk code. It would be nice to have a trunk patch which contains the new test.
        Hide
        Anoop Sam John added a comment -

        Patch for 94. Added the test case also. As Ram said the data check is also added.

        Show
        Anoop Sam John added a comment - Patch for 94. Added the test case also. As Ram said the data check is also added.
        Hide
        Anoop Sam John added a comment -

        I feel that we should check for the data part whether it was set correctly. The id part is any way going to be the same i feel.

        Yes Ram. I am also thinking so. This was not final patch any way. Was thinking that I can do this in final patch. This was for the feedback from you guys. As I read from the code the intent was to check this process id against the id in the zoo data. This is to check while this process try to change the data in the zookeeper another process changed that already or not. In such a case this process will give up this op. That is the whole reason why we add the id stuff in the data.

        The old check was any way wrong which compare the id against data.

        Yes Ted I can add the test case in the final patch.

        Show
        Anoop Sam John added a comment - I feel that we should check for the data part whether it was set correctly. The id part is any way going to be the same i feel. Yes Ram. I am also thinking so. This was not final patch any way. Was thinking that I can do this in final patch. This was for the feedback from you guys. As I read from the code the intent was to check this process id against the id in the zoo data. This is to check while this process try to change the data in the zookeeper another process changed that already or not. In such a case this process will give up this op. That is the whole reason why we add the id stuff in the data. The old check was any way wrong which compare the id against data. Yes Ted I can add the test case in the final patch.
        Hide
        ramkrishna.s.vasudevan added a comment -

        @Anoop
        I feel that we should check for the data part whether it was set correctly. The id part is any way going to be the same i feel.

        we need to compare the id read from the zookeeper and the id for this process (this.id)

        Just referring to the above comment. I may be wrong. I know you would have done sufficient study on this.

        Show
        ramkrishna.s.vasudevan added a comment - @Anoop I feel that we should check for the data part whether it was set correctly. The id part is any way going to be the same i feel. we need to compare the id read from the zookeeper and the id for this process (this.id) Just referring to the above comment. I may be wrong. I know you would have done sufficient study on this.
        Hide
        Ted Yu added a comment -

        Patch looks good - TestRecoverableZooKeeper passes with patch and fails without patch.

        Can you add TestRecoverableZooKeeper to the patch as well ?
        Please add license header and test category to TestRecoverableZooKeeper.

        Show
        Ted Yu added a comment - Patch looks good - TestRecoverableZooKeeper passes with patch and fails without patch. Can you add TestRecoverableZooKeeper to the patch as well ? Please add license header and test category to TestRecoverableZooKeeper.
        Hide
        Anoop Sam John added a comment -

        Patch for 94 version. Pls review

        Show
        Anoop Sam John added a comment - Patch for 94 version. Pls review
        Hide
        Anoop Sam John added a comment -

        Unit test case which clearly reproduces the issue with retry. Ideally the test case should not throw any Exception as the 1st call to setData actually sets the data and then throw ConnectionLossException. The subsequent call to setData in retry loop will throw BadVersionException from zookeeper layer but RecoverableZK should catch it and think the op as success

        Show
        Anoop Sam John added a comment - Unit test case which clearly reproduces the issue with retry. Ideally the test case should not throw any Exception as the 1st call to setData actually sets the data and then throw ConnectionLossException. The subsequent call to setData in retry loop will throw BadVersionException from zookeeper layer but RecoverableZK should catch it and think the op as success
        Hide
        Anoop Sam John added a comment -

        Stack - I will try with some sort of unit tests. I believe this caused the issue

        Show
        Anoop Sam John added a comment - Stack - I will try with some sort of unit tests. I believe this caused the issue
        Hide
        stack added a comment -

        Anoop Sam John I don't know (I would swear we were in here recently looking at this exact area of the code..)

        Show
        stack added a comment - Anoop Sam John I don't know (I would swear we were in here recently looking at this exact area of the code..)
        Hide
        Anoop Sam John added a comment -

        Yep the same. What does this compare?

        Show
        Anoop Sam John added a comment - Yep the same. What does this compare?
        Hide
        stack added a comment -

        Anoop Sam John You mean this bit Anoop?

         if(Bytes.compareTo(revData, ID_LENGTH_SIZE, id.length, 
        	  revData, dataOffset, dataLength) == 0) {
        
        Show
        stack added a comment - Anoop Sam John You mean this bit Anoop? if (Bytes.compareTo(revData, ID_LENGTH_SIZE, id.length, revData, dataOffset, dataLength) == 0) {
        Hide
        Anoop Sam John added a comment -

        Will submit patch .

        Show
        Anoop Sam John added a comment - Will submit patch .
        Hide
        Anoop Sam John added a comment -

        Is this code came in by mistake?

        RecoverableZooKeeper#setData(String path, byte[] data, int version){
        ....
          byte[] revData = zk.getData(path, false, stat);
          int idLength = Bytes.toInt(revData, ID_LENGTH_SIZE);
          int dataLength = revData.length-ID_LENGTH_SIZE-idLength;
          int dataOffset = ID_LENGTH_SIZE+idLength;
          
          if(Bytes.compareTo(revData, ID_LENGTH_SIZE, id.length, 
        	  revData, dataOffset, dataLength) == 0) {
        	// the bad version is caused by previous successful setData
        	return stat;
          }
        }
        

        When we write the data to zk, we write an identifier for the process. Here in order to check whether the BADVERSION exception from zookeeper is due to a previous setData (from the same process), we need to compare the id read from the zookeeper and the id for this process (this.id).. Or am I missing some thing. The above offset and length calculating math and compare looks problematic for me.

        In that case this is the issue for this bug I guess.

        From the log it is clear that there is no problem wrt the node and version at 1st. [As part of the transition of state from OPENING to OPENED 1st the present data is read and the check below tells the data and its version every thing is fine.] Immediately a connection loss happened. This triggers a retry for the setData. May be the previous operation made the data change in zookeeper and master got the data changed event.

        I think correcting the above code may solve the problems.

        Show
        Anoop Sam John added a comment - Is this code came in by mistake? RecoverableZooKeeper#setData( String path, byte [] data, int version){ .... byte [] revData = zk.getData(path, false , stat); int idLength = Bytes.toInt(revData, ID_LENGTH_SIZE); int dataLength = revData.length-ID_LENGTH_SIZE-idLength; int dataOffset = ID_LENGTH_SIZE+idLength; if (Bytes.compareTo(revData, ID_LENGTH_SIZE, id.length, revData, dataOffset, dataLength) == 0) { // the bad version is caused by previous successful setData return stat; } } When we write the data to zk, we write an identifier for the process. Here in order to check whether the BADVERSION exception from zookeeper is due to a previous setData (from the same process), we need to compare the id read from the zookeeper and the id for this process (this.id).. Or am I missing some thing. The above offset and length calculating math and compare looks problematic for me. In that case this is the issue for this bug I guess. From the log it is clear that there is no problem wrt the node and version at 1st. [As part of the transition of state from OPENING to OPENED 1st the present data is read and the check below tells the data and its version every thing is fine.] Immediately a connection loss happened. This triggers a retry for the setData. May be the previous operation made the data change in zookeeper and master got the data changed event. I think correcting the above code may solve the problems.
        Hide
        stack added a comment -

        Here is more from log Ram.

        Master side listing of all lines related to 1349052737638.9af7cfc9b15910a0b3d714bf40a3248f

        2012-10-22 02:20:01,351 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:10302-0xb3a862e57a503ba Async create of unassigned node for 9af7cfc9b15910a0b3d714bf40a3248f with OFFLINE state
        2012-10-22 02:20:24,577 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:10302-0xb3a862e57a503ba Set watcher on existing znode /hbase/unassigned/9af7cfc9b15910a0b3d714bf40a3248f
        2012-10-22 02:21:08,770 DEBUG org.apache.hadoop.hbase.master.AssignmentManager$CreateUnassignedAsyncCallback: rs=b9,\xEE\xAE\x9BiQO\x89]+a\xE0\x7F\xB7'X?,1349052737638.9af7cfc9b15910a0b3d714bf40a3248f. state=OFFLINE, ts=1350872401351, server=null, server=sv4r17s44,10304,1350872216778
        2012-10-22 02:21:09,495 DEBUG org.apache.hadoop.hbase.master.AssignmentManager$ExistsUnassignedAsyncCallback: rs=b9,\xEE\xAE\x9BiQO\x89]+a\xE0\x7F\xB7'X?,1349052737638.9af7cfc9b15910a0b3d714bf40a3248f. state=OFFLINE, ts=1350872401351, server=null
        2012-10-22 02:21:40,673 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:10302-0xb3a862e57a503ba Set watcher on existing znode /hbase/unassigned/9af7cfc9b15910a0b3d714bf40a3248f
        2012-10-22 02:23:47,089 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:10302-0xb3a862e57a503ba Set watcher on existing znode /hbase/unassigned/9af7cfc9b15910a0b3d714bf40a3248f
        2012-10-22 02:24:34,154 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: master:10302-0xb3a862e57a503ba Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/9af7cfc9b15910a0b3d714bf40a3248f
        2012-10-22 02:24:34,176 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:10302-0xb3a862e57a503ba Retrieved 112 byte(s) of data from znode /hbase/unassigned/9af7cfc9b15910a0b3d714bf40a3248f and set watcher; region=b9,\xEE\xAE\x9BiQO\x89]+a\xE0\x7F\xB7'X?,1349052737638.9af7cfc9b15910a0b3d714bf40a3248f., origin=sv4r17s44,10304,1350872216778, state=RS_ZK_REGION_OPENED
        2012-10-22 02:24:34,176 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=sv4r17s44,10304,1350872216778, region=9af7cfc9b15910a0b3d714bf40a3248f, which is more than 15 seconds late
        2012-10-22 02:24:34,176 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for b9,\xEE\xAE\x9BiQO\x89]+a\xE0\x7F\xB7'X?,1349052737638.9af7cfc9b15910a0b3d714bf40a3248f. from sv4r17s44,10304,1350872216778; deleting unassigned node
        2012-10-22 02:24:34,176 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:10302-0xb3a862e57a503ba Deleting existing unassigned node for 9af7cfc9b15910a0b3d714bf40a3248f that is in expected state RS_ZK_REGION_OPENED
        2012-10-22 02:24:34,221 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:10302-0xb3a862e57a503ba Retrieved 112 byte(s) of data from znode /hbase/unassigned/9af7cfc9b15910a0b3d714bf40a3248f; data=region=b9,\xEE\xAE\x9BiQO\x89]+a\xE0\x7F\xB7'X?,1349052737638.9af7cfc9b15910a0b3d714bf40a3248f., origin=sv4r17s44,10304,1350872216778, state=RS_ZK_REGION_OPENED
        2012-10-22 02:24:34,239 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:10302-0xb3a862e57a503ba Successfully deleted unassigned node for region 9af7cfc9b15910a0b3d714bf40a3248f in expected state RS_ZK_REGION_OPENED
        2012-10-22 02:27:09,169 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: master:10302-0xb3a862e57a503ba Received ZooKeeper Event, type=NodeDeleted, state=SyncConnected, path=/hbase/unassigned/9af7cfc9b15910a0b3d714bf40a3248f
        2012-10-22 02:27:09,169 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region b9,\xEE\xAE\x9BiQO\x89]+a\xE0\x7F\xB7'X?,1349052737638.9af7cfc9b15910a0b3d714bf40a3248f. has been deleted.
        2012-10-22 02:27:09,174 INFO org.apache.hadoop.hbase.master.AssignmentManager: The master has opened the region b9,\xEE\xAE\x9BiQO\x89]+a\xE0\x7F\xB7'X?,1349052737638.9af7cfc9b15910a0b3d714bf40a3248f. that was online on sv4r17s44,10304,1350872216778
        
        

        The above is being processed some minutes after the regionserver (see below) I guess because it is a bulk assign and there is a lot of other zk'ing going on at this time.

        Here is the regionserver side.

        2012-10-22 02:21:34,778 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:10304-0xd3a863aa2ee011a Attempting to transition node 9af7cfc9b15910a0b3d714bf40a3248f from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENED
        2012-10-22 02:21:34,801 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: regionserver:10304-0xd3a863aa2ee011a Retrieved 112 byte(s) of data from znode /hbase/unassigned/9af7cfc9b15910a0b3d714bf40a3248f; data=region=b9,\xEE\xAE\x9BiQO\x89]+a\xE0\x7F\xB7'X?,1349052737638.9af7cfc9b15910a0b3d714bf40a3248f., origin=sv4r17s44,10304,1350872216778, state=RS_ZK_REGION_OPE
        NING
        2012-10-22 02:21:48,240 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/unassigned/9af7cfc9b15910a0b3d714bf40a3248f
        2012-10-22 02:21:50,698 ERROR org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed transitioning node b9,\xEE\xAE\x9BiQO\x89]+a\xE0\x7F\xB7'X?,1349052737638.9af7cfc9b15910a0b3d714bf40a3248f. from OPENING to OPENED -- closing region
        org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/unassigned/9af7cfc9b15910a0b3d714bf40a3248f
        2012-10-22 02:21:50,698 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Closing b9,\xEE\xAE\x9BiQO\x89]+a\xE0\x7F\xB7'X?,1349052737638.9af7cfc9b15910a0b3d714bf40a3248f.: disabling compactions & flushes
        2012-10-22 02:21:50,698 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Updates disabled for region b9,\xEE\xAE\x9BiQO\x89]+a\xE0\x7F\xB7'X?,1349052737638.9af7cfc9b15910a0b3d714bf40a3248f.
        2012-10-22 02:21:50,703 INFO org.apache.hadoop.hbase.regionserver.HRegion: Closed b9,\xEE\xAE\x9BiQO\x89]+a\xE0\x7F\xB7'X?,1349052737638.9af7cfc9b15910a0b3d714bf40a3248f.
        2012-10-22 02:43:35,980 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: b9,\xEE\xAE\x9BiQO\x89]+a\xE0\x7F\xB7'X?,1349052737638.9af7cfc9b15910a0b3d714bf40a3248f.
        2012-10-22 02:43:36,392 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: b9,\xEE\xAE\x9BiQO\x89]+a\xE0\x7F\xB7'X?,1349052737638.9af7cfc9b15910a0b3d714bf40a3248f.
        2012-10-22 02:43:37,012 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: b9,\xEE\xAE\x9BiQO\x89]+a\xE0\x7F\xB7'X?,1349052737638.9af7cfc9b15910a0b3d714bf40a3248f.
        

        I think the client keeps coming here because it was here previously before the restart so that is what is in the .META. table for the location.

        Show
        stack added a comment - Here is more from log Ram. Master side listing of all lines related to 1349052737638.9af7cfc9b15910a0b3d714bf40a3248f 2012-10-22 02:20:01,351 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:10302-0xb3a862e57a503ba Async create of unassigned node for 9af7cfc9b15910a0b3d714bf40a3248f with OFFLINE state 2012-10-22 02:20:24,577 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:10302-0xb3a862e57a503ba Set watcher on existing znode /hbase/unassigned/9af7cfc9b15910a0b3d714bf40a3248f 2012-10-22 02:21:08,770 DEBUG org.apache.hadoop.hbase.master.AssignmentManager$CreateUnassignedAsyncCallback: rs=b9,\xEE\xAE\x9BiQO\x89]+a\xE0\x7F\xB7'X?,1349052737638.9af7cfc9b15910a0b3d714bf40a3248f. state=OFFLINE, ts=1350872401351, server= null , server=sv4r17s44,10304,1350872216778 2012-10-22 02:21:09,495 DEBUG org.apache.hadoop.hbase.master.AssignmentManager$ExistsUnassignedAsyncCallback: rs=b9,\xEE\xAE\x9BiQO\x89]+a\xE0\x7F\xB7'X?,1349052737638.9af7cfc9b15910a0b3d714bf40a3248f. state=OFFLINE, ts=1350872401351, server= null 2012-10-22 02:21:40,673 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:10302-0xb3a862e57a503ba Set watcher on existing znode /hbase/unassigned/9af7cfc9b15910a0b3d714bf40a3248f 2012-10-22 02:23:47,089 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:10302-0xb3a862e57a503ba Set watcher on existing znode /hbase/unassigned/9af7cfc9b15910a0b3d714bf40a3248f 2012-10-22 02:24:34,154 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: master:10302-0xb3a862e57a503ba Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/9af7cfc9b15910a0b3d714bf40a3248f 2012-10-22 02:24:34,176 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:10302-0xb3a862e57a503ba Retrieved 112 byte (s) of data from znode /hbase/unassigned/9af7cfc9b15910a0b3d714bf40a3248f and set watcher; region=b9,\xEE\xAE\x9BiQO\x89]+a\xE0\x7F\xB7'X?,1349052737638.9af7cfc9b15910a0b3d714bf40a3248f., origin=sv4r17s44,10304,1350872216778, state=RS_ZK_REGION_OPENED 2012-10-22 02:24:34,176 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=sv4r17s44,10304,1350872216778, region=9af7cfc9b15910a0b3d714bf40a3248f, which is more than 15 seconds late 2012-10-22 02:24:34,176 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for b9,\xEE\xAE\x9BiQO\x89]+a\xE0\x7F\xB7'X?,1349052737638.9af7cfc9b15910a0b3d714bf40a3248f. from sv4r17s44,10304,1350872216778; deleting unassigned node 2012-10-22 02:24:34,176 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:10302-0xb3a862e57a503ba Deleting existing unassigned node for 9af7cfc9b15910a0b3d714bf40a3248f that is in expected state RS_ZK_REGION_OPENED 2012-10-22 02:24:34,221 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:10302-0xb3a862e57a503ba Retrieved 112 byte (s) of data from znode /hbase/unassigned/9af7cfc9b15910a0b3d714bf40a3248f; data=region=b9,\xEE\xAE\x9BiQO\x89]+a\xE0\x7F\xB7'X?,1349052737638.9af7cfc9b15910a0b3d714bf40a3248f., origin=sv4r17s44,10304,1350872216778, state=RS_ZK_REGION_OPENED 2012-10-22 02:24:34,239 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:10302-0xb3a862e57a503ba Successfully deleted unassigned node for region 9af7cfc9b15910a0b3d714bf40a3248f in expected state RS_ZK_REGION_OPENED 2012-10-22 02:27:09,169 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: master:10302-0xb3a862e57a503ba Received ZooKeeper Event, type=NodeDeleted, state=SyncConnected, path=/hbase/unassigned/9af7cfc9b15910a0b3d714bf40a3248f 2012-10-22 02:27:09,169 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region b9,\xEE\xAE\x9BiQO\x89]+a\xE0\x7F\xB7'X?,1349052737638.9af7cfc9b15910a0b3d714bf40a3248f. has been deleted. 2012-10-22 02:27:09,174 INFO org.apache.hadoop.hbase.master.AssignmentManager: The master has opened the region b9,\xEE\xAE\x9BiQO\x89]+a\xE0\x7F\xB7'X?,1349052737638.9af7cfc9b15910a0b3d714bf40a3248f. that was online on sv4r17s44,10304,1350872216778 The above is being processed some minutes after the regionserver (see below) I guess because it is a bulk assign and there is a lot of other zk'ing going on at this time. Here is the regionserver side. 2012-10-22 02:21:34,778 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:10304-0xd3a863aa2ee011a Attempting to transition node 9af7cfc9b15910a0b3d714bf40a3248f from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENED 2012-10-22 02:21:34,801 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: regionserver:10304-0xd3a863aa2ee011a Retrieved 112 byte (s) of data from znode /hbase/unassigned/9af7cfc9b15910a0b3d714bf40a3248f; data=region=b9,\xEE\xAE\x9BiQO\x89]+a\xE0\x7F\xB7'X?,1349052737638.9af7cfc9b15910a0b3d714bf40a3248f., origin=sv4r17s44,10304,1350872216778, state=RS_ZK_REGION_OPE NING 2012-10-22 02:21:48,240 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/unassigned/9af7cfc9b15910a0b3d714bf40a3248f 2012-10-22 02:21:50,698 ERROR org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed transitioning node b9,\xEE\xAE\x9BiQO\x89]+a\xE0\x7F\xB7'X?,1349052737638.9af7cfc9b15910a0b3d714bf40a3248f. from OPENING to OPENED -- closing region org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/unassigned/9af7cfc9b15910a0b3d714bf40a3248f 2012-10-22 02:21:50,698 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Closing b9,\xEE\xAE\x9BiQO\x89]+a\xE0\x7F\xB7'X?,1349052737638.9af7cfc9b15910a0b3d714bf40a3248f.: disabling compactions & flushes 2012-10-22 02:21:50,698 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Updates disabled for region b9,\xEE\xAE\x9BiQO\x89]+a\xE0\x7F\xB7'X?,1349052737638.9af7cfc9b15910a0b3d714bf40a3248f. 2012-10-22 02:21:50,703 INFO org.apache.hadoop.hbase.regionserver.HRegion: Closed b9,\xEE\xAE\x9BiQO\x89]+a\xE0\x7F\xB7'X?,1349052737638.9af7cfc9b15910a0b3d714bf40a3248f. 2012-10-22 02:43:35,980 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: b9,\xEE\xAE\x9BiQO\x89]+a\xE0\x7F\xB7'X?,1349052737638.9af7cfc9b15910a0b3d714bf40a3248f. 2012-10-22 02:43:36,392 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: b9,\xEE\xAE\x9BiQO\x89]+a\xE0\x7F\xB7'X?,1349052737638.9af7cfc9b15910a0b3d714bf40a3248f. 2012-10-22 02:43:37,012 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: b9,\xEE\xAE\x9BiQO\x89]+a\xE0\x7F\xB7'X?,1349052737638.9af7cfc9b15910a0b3d714bf40a3248f. I think the client keeps coming here because it was here previously before the restart so that is what is in the .META. table for the location.
        Hide
        ramkrishna.s.vasudevan added a comment -

        @Stack
        This happens in 0.94.2? Which RS failed in transitioning the node to OPENED? new one or the one which already transitioned to OPENED?
        Sorry for the questions as there are no full logs i had these doubts. Thanks Stack for reporting this, it should be useful for us also.

        Show
        ramkrishna.s.vasudevan added a comment - @Stack This happens in 0.94.2? Which RS failed in transitioning the node to OPENED? new one or the one which already transitioned to OPENED? Sorry for the questions as there are no full logs i had these doubts. Thanks Stack for reporting this, it should be useful for us also.

          People

          • Assignee:
            Anoop Sam John
            Reporter:
            stack
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development