Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.90.3
    • Fix Version/s: 0.90.4
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      In TimeoutMonitor :
      if node exists and node state is RS_ZK_REGION_CLOSED
      We should send a zk message again when close region is timeout.
      in this case, It may be loss some message.

      I See. It seems like a bug. This is my analysis.

      // disable table and master sent Close message to region server, Region state was set PENDING_CLOSE

      2011-05-08 17:44:25,745 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=C4C4.site,60020,1304820199467, load=(requests=0, regions=123, usedHeap=4097, maxHeap=8175) for region ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.
      2011-05-08 17:44:45,530 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467
      2011-05-08 17:45:45,542 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467

      // received splitting message and cleared Region state (PENDING_CLOSE)

      2011-05-08 17:46:45,303 WARN org.apache.hadoop.hbase.master.AssignmentManager: Overwriting 4418fb197685a21f77e151e401cf8b66 on serverName=C4C4.site,60020,1304820199467, load=(requests=0, regions=123, usedHeap=4097, maxHeap=8175)
      2011-05-08 17:46:45,538 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467
      2011-05-08 17:47:45,548 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467
      2011-05-08 17:48:45,545 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467
      2011-05-08 17:49:46,108 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467
      2011-05-08 17:50:46,105 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467
      2011-05-08 17:51:46,117 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467
      2011-05-08 17:52:46,112 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467
      2011-05-08 17:52:47,309 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:60000-0x22fcd582836003d Retrieved 125 byte(s) of data from znode /hbase/unassigned/4418fb197685a21f77e151e401cf8b66 and set watcher; region=ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66., server=C4C4.site,60020,1304820199467, state=RS_ZK_REGION_CLOSED
      2011-05-08 17:52:47,388 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/4418fb197685a21f77e151e401cf8b66 (region=ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66., server=C4C4.site,60020,1304820199467, state=RS_ZK_REGION_CLOSED)
      2011-05-08 17:52:47,388 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=C4C4.site,60020,1304820199467, region=4418fb197685a21f77e151e401cf8b66

      // region server had closed region, but the region state had cleared. So it printed warning log.

      2011-05-08 17:52:47,388 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 4418fb197685a21f77e151e401cf8b66 from server C4C4.site,60020,1304820199467 but region was in the state null and not in expected PENDING_CLOSE or CLOSING states
      2011-05-08 17:52:47,397 WARN org.apache.hadoop.hbase.master.AssignmentManager: Overwriting 4418fb197685a21f77e151e401cf8b66 on serverName=C4C4.site,60020,1304820199467, load=(requests=0, regions=123, usedHeap=4097, maxHeap=8175)

      // The region state was set PENDING_CLOSE again. the table couldn't disable and enable.
      2011-05-08 17:52:47,398 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. (offlining)

      1. logs.rar
        9 kB
        gaojinchao
      2. AssignmentManager_90v4.patch
        2 kB
        gaojinchao
      3. AssignmentManager_90v3.patch
        2 kB
        gaojinchao

        Activity

        Hide
        stack added a comment -

        @Gao Thanks for the info.

        Show
        stack added a comment - @Gao Thanks for the info.
        Hide
        gaojinchao added a comment -

        TRUNK don't need. It has modified for zk watcher.

        Below code should protect this case.
        // RegionState must be null, or SPLITTING or PENDING_CLOSE.
        if (!isInStateForSplitting(regionState)) break;

        Show
        gaojinchao added a comment - TRUNK don't need. It has modified for zk watcher. Below code should protect this case. // RegionState must be null, or SPLITTING or PENDING_CLOSE. if (!isInStateForSplitting(regionState)) break;
        Hide
        stack added a comment -

        Applied to branch. I talked to J-D and we need to be able to handle double-split messages however they arise.

        This patch won't work for TRUNK. Its changed too much. Will need something like this in TRUNK.

        Show
        stack added a comment - Applied to branch. I talked to J-D and we need to be able to handle double-split messages however they arise. This patch won't work for TRUNK. Its changed too much. Will need something like this in TRUNK.
        Hide
        gaojinchao added a comment -

        It didn't reproduce. So, My guess J-D is right. Below logs shows that:

        Region server repeated message an interval of 60s. So It should be IPC timeout.
        2011-05-08 17:43:45,507 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at C4C1.site:60000
        2011-05-08 17:44:45,521 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at C4C1.site:60000
        2011-05-08 17:45:45,524 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at C4C1.site:60000
        2011-05-08 17:46:45,528 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at C4C1.site:60000

        Hmaster logs shows regionServerReport IPC had been closed. It also proved IPC timeout

        2011-05-08 17:52:47,703 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server Responder, call regionServerReport(serverName=C4C3.site,60020,1304820199474, load=(requests=0, regions=55, usedHeap=1058, maxHeap=8175), [Lorg.apache.hadoop.hbase.HMsg;@1453ecec, [Lorg.apache.hadoop.hbase.HRegionInfo;@11e78461) from 157.5.100.3:37518: output error
        2011-05-08 17:52:47,704 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server handler 7 on 60000 caught: java.nio.channels.ClosedChannelException
        at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
        at org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1341)
        at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:727)
        at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:792)
        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1083)

        But I can't dig the root reason.

        Show
        gaojinchao added a comment - It didn't reproduce. So, My guess J-D is right. Below logs shows that: Region server repeated message an interval of 60s. So It should be IPC timeout. 2011-05-08 17:43:45,507 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at C4C1.site:60000 2011-05-08 17:44:45,521 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at C4C1.site:60000 2011-05-08 17:45:45,524 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at C4C1.site:60000 2011-05-08 17:46:45,528 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at C4C1.site:60000 Hmaster logs shows regionServerReport IPC had been closed. It also proved IPC timeout 2011-05-08 17:52:47,703 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server Responder, call regionServerReport(serverName=C4C3.site,60020,1304820199474, load=(requests=0, regions=55, usedHeap=1058, maxHeap=8175), [Lorg.apache.hadoop.hbase.HMsg;@1453ecec, [Lorg.apache.hadoop.hbase.HRegionInfo;@11e78461) from 157.5.100.3:37518: output error 2011-05-08 17:52:47,704 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server handler 7 on 60000 caught: java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324) at org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1341) at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:727) at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:792) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1083) But I can't dig the root reason.
        Hide
        stack added a comment -

        Gao:

        So, help me out:

        +    synchronized (this.regions) {         
        +      //one daughter is already online, do nothing
        +      HServerInfo hsia = this.regions.get(a);
        +      if (hsia != null){
        +        LOG.warn("Trying to process the split of " +a.getEncodedName()+ ", " +
        +          "but it was already done and one daughter is on region server " + hsia);
        +        return;
        +      }
        +    }
        

        How would the daughter region be on line already? We are sending the split message multiple times? I see it above in a comment but am not clear why this would happen.

        Is it because master was not letting messages in? The regionserver failed to deliver its report so it retried? But in actuality the report HAD been delivered and it is just a case that the regionserver didn't wait long enough?

        Patch looks good. Just want to be clear on what we think its fixing (As J-D said, do you have some log snippet from regionserver when it fails to deliver the report? Is there a timeout in there?).

        Show
        stack added a comment - Gao: So, help me out: + synchronized ( this .regions) { + //one daughter is already online, do nothing + HServerInfo hsia = this .regions.get(a); + if (hsia != null ){ + LOG.warn( "Trying to process the split of " +a.getEncodedName()+ ", " + + "but it was already done and one daughter is on region server " + hsia); + return ; + } + } How would the daughter region be on line already? We are sending the split message multiple times? I see it above in a comment but am not clear why this would happen. Is it because master was not letting messages in? The regionserver failed to deliver its report so it retried? But in actuality the report HAD been delivered and it is just a case that the regionserver didn't wait long enough? Patch looks good. Just want to be clear on what we think its fixing (As J-D said, do you have some log snippet from regionserver when it fails to deliver the report? Is there a timeout in there?).
        Hide
        gaojinchao added a comment -

        Hi J-D,
        I have created a unit test and verified it. Please review it and give me some suggesion.

        Thanks.

        Show
        gaojinchao added a comment - Hi J-D, I have created a unit test and verified it. Please review it and give me some suggesion. Thanks.
        Hide
        gaojinchao added a comment -

        No, It need review and merge.

        Show
        gaojinchao added a comment - No, It need review and merge.
        Hide
        kristam subba swathi added a comment -

        is this got fixed in hbase 0.90.4?

        Show
        kristam subba swathi added a comment - is this got fixed in hbase 0.90.4?
        Hide
        gaojinchao added a comment -

        I am not familiar with unit test(It seems diffult to send a double report of a split and test cluster function). So
        I verified it by modified region server code.

        Logs:
        2011-06-02 19:57:49,056 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr2,8613816508215#1308,1307014802589.c5a8a47d9c84f417b9fcc4c8019e7c7e.: Daughters; ufdr2,8613816508215#1308,1307015867020.37481173e31ea469bcaa310cf8d7d980., ufdr2,8613816595415#3432,1307015867020.afbf02ef235cabe66026f7c393d79bc0. from C4C4.site,60020,1307015130114
        2011-06-02 19:57:49,057 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:60000-0x30502c278f0000 Unable to get data of znode /hbase/unassigned/c5a8a47d9c84f417b9fcc4c8019e7c7e because node does not exist (not necessarily an error)
        2011-06-02 19:57:49,081 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr2,8613816508215#1308,1307014802589.c5a8a47d9c84f417b9fcc4c8019e7c7e.: Daughters; ufdr2,8613816508215#1308,1307015867020.37481173e31ea469bcaa310cf8d7d980., ufdr2,8613816595415#3432,1307015867020.afbf02ef235cabe66026f7c393d79bc0. from C4C4.site,60020,1307015130114
        2011-06-02 19:57:49,083 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:60000-0x30502c278f0000 Unable to get data of znode /hbase/unassigned/c5a8a47d9c84f417b9fcc4c8019e7c7e because node does not exist (not necessarily an error)
        2011-06-02 19:57:49,083 WARN org.apache.hadoop.hbase.master.AssignmentManager: Trying to process the split of 37481173e31ea469bcaa310cf8d7d980, but it was already done and one daughter is on region server serverName=C4C4.site,60020,1307015130114, load=(requests=0, regions=0, usedHeap=0, maxHeap=0)
        2011-06-02 19:57:56,468 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr2,8613819021840#1446,1307014756068.f865d41d918297f30b576b9ea3ccea07.: Daughters; ufdr2,8613819021840#1446,1307015873554.baa21e4f0cfa5840f009d0fac8e83d15., ufdr2,8613819104397#3916,1307015873554.fb63f608e5e37f5e85d71c925bc78010. from C4C3.site,60020,1307015129703
        2011-06-02 19:57:56,470 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:60000-0x30502c278f0000 Unable to get data of znode /hbase/unassigned/f865d41d918297f30b576b9ea3ccea07 because node does not exist (not necessarily an error)
        2011-06-02 19:57:56,472 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr2,8613819021840#1446,1307014756068.f865d41d918297f30b576b9ea3ccea07.: Daughters; ufdr2,8613819021840#1446,1307015873554.baa21e4f0cfa5840f009d0fac8e83d15., ufdr2,8613819104397#3916,1307015873554.fb63f608e5e37f5e85d71c925bc78010. from C4C3.site,60020,1307015129703
        2011-06-02 19:57:56,474 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:60000-0x30502c278f0000 Unable to get data of znode /hbase/unassigned/f865d41d918297f30b576b9ea3ccea07 because node does not exist (not necessarily an error)
        2011-06-02 19:57:56,474 WARN org.apache.hadoop.hbase.master.AssignmentManager: Trying to process the split of baa21e4f0cfa5840f009d0fac8e83d15, but it was already done and one daughter is on region server serverName=C4C3.site,60020,1307015129703, load=(requests=0, regions=0, usedHeap=0, maxHeap=0)

        Thanks for your hint. It should be a 60 seconds timeout. Region server repeated message about 60s.

        2011-05-08 17:43:45,507 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at C4C1.site:60000
        2011-05-08 17:44:45,521 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at C4C1.site:60000

        It seems a race with regionsInTransition. So IPC was blocked. I try to reproduce it.

        Hmaster logs: It recieved so much message RS_ZK_REGION_CLOSED.
        2011-05-08 17:43:45,157 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:60000-0x22fcd582836003d Retrieved 125 byte(s) of data from znode /hbase/unassigned/83c05d9ead23d9a260edf30dc8739cf7 and set watcher; region=ufdr,2011050802#8613815394007#0610,1304847545412.83c05d9ead23d9a260edf30dc8739cf7., server=C4C4.site,60020,1304820199467, state=RS_ZK_REGION_CLOSING
        2011-05-08 17:43:45,525 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467
        2011-05-08 17:43:48,943 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:60000-0x22fcd582836003d Retrieved 125 byte(s) of data from znode /hbase/unassigned/5e3bacf3f43b6bad874e80c2f971e632 and set watcher; region=ufdr,2011050803#8613819042855#3419,1304844562298.5e3bacf3f43b6bad874e80c2f971e632., server=C4C5.site,60020,1304820199469, state=RS_ZK_REGION_CLOSED
        2011-05-08 17:43:50,777 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:60000-0x22fcd582836003d Retrieved 125 byte(s) of data from znode /hbase/unassigned/aaddc1ecd4ab768357b3a44df6828e23 and set watcher; region=ufdr,2011050810#8613815306434#1853,1304823564447.aaddc1ecd4ab768357b3a44df6828e23., server=C4C5.site,60020,1304820199469, state=RS_ZK_REGION_CLOSED
        2011-05-08 17:43:53,219 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:60000-0x22fcd582836003d Retrieved 125 byte(s) of data from znode /hbase/unassigned/6bf4788ea7349a62967b6781b37cf157 and set watcher; region=ufdr,2011050802#8613816510664#5948,1304842169086.6bf4788ea7349a62967b6781b37cf157., server=C4C4.site,60020,1304820199467, state=RS_ZK_REGION_CLOSED
        2011-05-08 17:43:55,812 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:60000-0x22fcd582836003d Retrieved 125 byte(s) of data from znode /hbase/unassigned/3decca49c2dbc3f8a9d49f192b747b12 and set watcher; region=ufdr,2011050801#8613819767209#4733,1304837619424.3decca49c2dbc3f8a9d49f192b747b12., server=C4C3.site,60020,1304820199474, state=RS_ZK_REGION_CLOSED
        2011-05-08 17:43:56,722 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:60000-0x22fcd582836003d Retrieved 125 byte(s) of data from znode /hbase/unassigned/49cec92a2e7dd6d8ef38aab6128b29f8 and set watcher; region=ufdr,2011050802#8613815474662#2305,1304847545412.49cec92a2e7dd6d8ef38aab6128b29f8., server=C4C4.site,60020,1304820199467, state=RS_ZK_REGION_CLOSED
        2011-05-08 17:44:01,382 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:60000-0x22fcd582836003d Retrieved 125 byte(s) of data from znode /hbase/unassigned/fcaaff18b3afa53e2bae3c9eb7b3aca2 and set watcher; region=ufdr,2011050803#8613819212381#4006,1304845030265.fcaaff18b3afa53e2bae3c9eb7b3aca2., server=C4C5.site,60020,1304820199469, state=RS_ZK_REGION_CLOSED
        2011-05-08 17:44:01,383 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:60000-0x22fcd582836003d Retrieved 125 byte(s) of data from znode /hbase/unassigned/6fa30b1a2352e36e0b5c9272b61e17cf and set watcher; region=ufdr,2011050801#8613818694123#3028,1304847154793.6fa30b1a2352e36e0b5c9272b61e17cf., server=C4C4.site,60020,1304820199467, state=RS_ZK_REGION_CLOSED

        Show
        gaojinchao added a comment - I am not familiar with unit test(It seems diffult to send a double report of a split and test cluster function). So I verified it by modified region server code. Logs: 2011-06-02 19:57:49,056 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr2,8613816508215#1308,1307014802589.c5a8a47d9c84f417b9fcc4c8019e7c7e.: Daughters; ufdr2,8613816508215#1308,1307015867020.37481173e31ea469bcaa310cf8d7d980., ufdr2,8613816595415#3432,1307015867020.afbf02ef235cabe66026f7c393d79bc0. from C4C4.site,60020,1307015130114 2011-06-02 19:57:49,057 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:60000-0x30502c278f0000 Unable to get data of znode /hbase/unassigned/c5a8a47d9c84f417b9fcc4c8019e7c7e because node does not exist (not necessarily an error) 2011-06-02 19:57:49,081 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr2,8613816508215#1308,1307014802589.c5a8a47d9c84f417b9fcc4c8019e7c7e.: Daughters; ufdr2,8613816508215#1308,1307015867020.37481173e31ea469bcaa310cf8d7d980., ufdr2,8613816595415#3432,1307015867020.afbf02ef235cabe66026f7c393d79bc0. from C4C4.site,60020,1307015130114 2011-06-02 19:57:49,083 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:60000-0x30502c278f0000 Unable to get data of znode /hbase/unassigned/c5a8a47d9c84f417b9fcc4c8019e7c7e because node does not exist (not necessarily an error) 2011-06-02 19:57:49,083 WARN org.apache.hadoop.hbase.master.AssignmentManager: Trying to process the split of 37481173e31ea469bcaa310cf8d7d980, but it was already done and one daughter is on region server serverName=C4C4.site,60020,1307015130114, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) 2011-06-02 19:57:56,468 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr2,8613819021840#1446,1307014756068.f865d41d918297f30b576b9ea3ccea07.: Daughters; ufdr2,8613819021840#1446,1307015873554.baa21e4f0cfa5840f009d0fac8e83d15., ufdr2,8613819104397#3916,1307015873554.fb63f608e5e37f5e85d71c925bc78010. from C4C3.site,60020,1307015129703 2011-06-02 19:57:56,470 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:60000-0x30502c278f0000 Unable to get data of znode /hbase/unassigned/f865d41d918297f30b576b9ea3ccea07 because node does not exist (not necessarily an error) 2011-06-02 19:57:56,472 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr2,8613819021840#1446,1307014756068.f865d41d918297f30b576b9ea3ccea07.: Daughters; ufdr2,8613819021840#1446,1307015873554.baa21e4f0cfa5840f009d0fac8e83d15., ufdr2,8613819104397#3916,1307015873554.fb63f608e5e37f5e85d71c925bc78010. from C4C3.site,60020,1307015129703 2011-06-02 19:57:56,474 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:60000-0x30502c278f0000 Unable to get data of znode /hbase/unassigned/f865d41d918297f30b576b9ea3ccea07 because node does not exist (not necessarily an error) 2011-06-02 19:57:56,474 WARN org.apache.hadoop.hbase.master.AssignmentManager: Trying to process the split of baa21e4f0cfa5840f009d0fac8e83d15, but it was already done and one daughter is on region server serverName=C4C3.site,60020,1307015129703, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) Thanks for your hint. It should be a 60 seconds timeout. Region server repeated message about 60s. 2011-05-08 17:43:45,507 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at C4C1.site:60000 2011-05-08 17:44:45,521 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at C4C1.site:60000 It seems a race with regionsInTransition. So IPC was blocked. I try to reproduce it. Hmaster logs: It recieved so much message RS_ZK_REGION_CLOSED. 2011-05-08 17:43:45,157 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:60000-0x22fcd582836003d Retrieved 125 byte(s) of data from znode /hbase/unassigned/83c05d9ead23d9a260edf30dc8739cf7 and set watcher; region=ufdr,2011050802#8613815394007#0610,1304847545412.83c05d9ead23d9a260edf30dc8739cf7., server=C4C4.site,60020,1304820199467, state=RS_ZK_REGION_CLOSING 2011-05-08 17:43:45,525 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:43:48,943 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:60000-0x22fcd582836003d Retrieved 125 byte(s) of data from znode /hbase/unassigned/5e3bacf3f43b6bad874e80c2f971e632 and set watcher; region=ufdr,2011050803#8613819042855#3419,1304844562298.5e3bacf3f43b6bad874e80c2f971e632., server=C4C5.site,60020,1304820199469, state=RS_ZK_REGION_CLOSED 2011-05-08 17:43:50,777 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:60000-0x22fcd582836003d Retrieved 125 byte(s) of data from znode /hbase/unassigned/aaddc1ecd4ab768357b3a44df6828e23 and set watcher; region=ufdr,2011050810#8613815306434#1853,1304823564447.aaddc1ecd4ab768357b3a44df6828e23., server=C4C5.site,60020,1304820199469, state=RS_ZK_REGION_CLOSED 2011-05-08 17:43:53,219 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:60000-0x22fcd582836003d Retrieved 125 byte(s) of data from znode /hbase/unassigned/6bf4788ea7349a62967b6781b37cf157 and set watcher; region=ufdr,2011050802#8613816510664#5948,1304842169086.6bf4788ea7349a62967b6781b37cf157., server=C4C4.site,60020,1304820199467, state=RS_ZK_REGION_CLOSED 2011-05-08 17:43:55,812 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:60000-0x22fcd582836003d Retrieved 125 byte(s) of data from znode /hbase/unassigned/3decca49c2dbc3f8a9d49f192b747b12 and set watcher; region=ufdr,2011050801#8613819767209#4733,1304837619424.3decca49c2dbc3f8a9d49f192b747b12., server=C4C3.site,60020,1304820199474, state=RS_ZK_REGION_CLOSED 2011-05-08 17:43:56,722 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:60000-0x22fcd582836003d Retrieved 125 byte(s) of data from znode /hbase/unassigned/49cec92a2e7dd6d8ef38aab6128b29f8 and set watcher; region=ufdr,2011050802#8613815474662#2305,1304847545412.49cec92a2e7dd6d8ef38aab6128b29f8., server=C4C4.site,60020,1304820199467, state=RS_ZK_REGION_CLOSED 2011-05-08 17:44:01,382 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:60000-0x22fcd582836003d Retrieved 125 byte(s) of data from znode /hbase/unassigned/fcaaff18b3afa53e2bae3c9eb7b3aca2 and set watcher; region=ufdr,2011050803#8613819212381#4006,1304845030265.fcaaff18b3afa53e2bae3c9eb7b3aca2., server=C4C5.site,60020,1304820199469, state=RS_ZK_REGION_CLOSED 2011-05-08 17:44:01,383 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:60000-0x22fcd582836003d Retrieved 125 byte(s) of data from znode /hbase/unassigned/6fa30b1a2352e36e0b5c9272b61e17cf and set watcher; region=ufdr,2011050801#8613818694123#3028,1304847154793.6fa30b1a2352e36e0b5c9272b61e17cf., server=C4C4.site,60020,1304820199467, state=RS_ZK_REGION_CLOSED
        Hide
        Jean-Daniel Cryans added a comment -

        Why did regionServerReport throw exception ?

        My guess is that it was a 60 seconds timeout, unfortunately we don't print it. Could you add that to your patch?

        Also regarding the patch, could you create a unit test that sends a double report of a split and then verifies that the second one didn't screw up anything?

        Also fix this line:

        //if Daughters has online, do nothing

        to

        // if one daughter is already online, do nothing

        And then maybe add more context to the error message, like "Trying to process the split of region_foo, but it was already done and one daughter is on region server regionserver_bar".

        Show
        Jean-Daniel Cryans added a comment - Why did regionServerReport throw exception ? My guess is that it was a 60 seconds timeout, unfortunately we don't print it. Could you add that to your patch? Also regarding the patch, could you create a unit test that sends a double report of a split and then verifies that the second one didn't screw up anything? Also fix this line: //if Daughters has online, do nothing to // if one daughter is already online, do nothing And then maybe add more context to the error message, like "Trying to process the split of region_foo, but it was already done and one daughter is on region server regionserver_bar".
        Hide
        gaojinchao added a comment -

        patch(AssignmentManager_90v2) looks like benefit. Thanks.

        Show
        gaojinchao added a comment - patch(AssignmentManager_90v2) looks like benefit. Thanks.
        Hide
        gaojinchao added a comment -

        I keep digging, find the repeated message was sent by region server.

        If regionServerReport throwed exception, Region server will connect Hmaster again and send message again.

        //region server logs.
        2011-05-08 17:43:45,507 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at C4C1.site:60000
        2011-05-08 17:44:45,521 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at C4C1.site:60000
        2011-05-08 17:45:45,524 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at C4C1.site:60000
        2011-05-08 17:46:45,528 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at C4C1.site:60000
        2011-05-08 17:47:45,531 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at C4C1.site:60000
        2011-05-08 17:48:45,535 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at C4C1.site:60000
        2011-05-08 17:49:46,091 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at C4C1.site:60000
        2011-05-08 17:50:46,096 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at C4C1.site:60000
        2011-05-08 17:51:46,099 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at C4C1.site:60000
        2011-05-08 17:52:46,104 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at C4C1.site:60000

        //region server code.
        List<HMsg> tryRegionServerReport(final List<HMsg> outboundMessages)
        throws IOException {
        this.serverInfo.setLoad(buildServerLoad());
        this.requestCount.set(0);
        addOutboundMsgs(outboundMessages);
        HMsg [] msgs = null;
        while (!this.stopped) {
        try

        { msgs = this.hbaseMaster.regionServerReport(this.serverInfo, outboundMessages.toArray(HMsg.EMPTY_HMSG_ARRAY), getMostLoadedRegions()); break; }

        catch (IOException ioe) {
        if (ioe instanceof RemoteException)

        { ioe = ((RemoteException)ioe).unwrapRemoteException(); }

        if (ioe instanceof YouAreDeadException)

        { // This will be caught and handled as a fatal error in run() throw ioe; }

        // Couldn't connect to the master, get location from zk and reconnect
        // Method blocks until new master is found or we are stopped
        getMaster();
        }
        }

        Why did regionServerReport throw exception ?

        It seems Hmaster was busy and IPC blocked.

        Hmaster logs:
        2011-05-08 17:44:25,745 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server Responder, call regionServerReport(serverName=C4C4.site,60020,1304820199467, load=(requests=0, regions=123, usedHeap=4097, maxHeap=8175), [Lorg.apache.hadoop.hbase.HMsg;@520ed128, [Lorg.apache.hadoop.hbase.HRegionInfo;@4ac5c32e) from 157.5.100.4:50187: output error
        2011-05-08 17:44:25,745 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server handler 11 on 60000 caught: java.nio.channels.ClosedChannelException
        at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
        at org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1341)
        at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:727)
        at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:792)
        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1083)

        Show
        gaojinchao added a comment - I keep digging, find the repeated message was sent by region server. If regionServerReport throwed exception, Region server will connect Hmaster again and send message again. //region server logs. 2011-05-08 17:43:45,507 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at C4C1.site:60000 2011-05-08 17:44:45,521 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at C4C1.site:60000 2011-05-08 17:45:45,524 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at C4C1.site:60000 2011-05-08 17:46:45,528 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at C4C1.site:60000 2011-05-08 17:47:45,531 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at C4C1.site:60000 2011-05-08 17:48:45,535 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at C4C1.site:60000 2011-05-08 17:49:46,091 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at C4C1.site:60000 2011-05-08 17:50:46,096 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at C4C1.site:60000 2011-05-08 17:51:46,099 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at C4C1.site:60000 2011-05-08 17:52:46,104 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected to master at C4C1.site:60000 //region server code. List<HMsg> tryRegionServerReport(final List<HMsg> outboundMessages) throws IOException { this.serverInfo.setLoad(buildServerLoad()); this.requestCount.set(0); addOutboundMsgs(outboundMessages); HMsg [] msgs = null; while (!this.stopped) { try { msgs = this.hbaseMaster.regionServerReport(this.serverInfo, outboundMessages.toArray(HMsg.EMPTY_HMSG_ARRAY), getMostLoadedRegions()); break; } catch (IOException ioe) { if (ioe instanceof RemoteException) { ioe = ((RemoteException)ioe).unwrapRemoteException(); } if (ioe instanceof YouAreDeadException) { // This will be caught and handled as a fatal error in run() throw ioe; } // Couldn't connect to the master, get location from zk and reconnect // Method blocks until new master is found or we are stopped getMaster(); } } Why did regionServerReport throw exception ? It seems Hmaster was busy and IPC blocked. Hmaster logs: 2011-05-08 17:44:25,745 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server Responder, call regionServerReport(serverName=C4C4.site,60020,1304820199467, load=(requests=0, regions=123, usedHeap=4097, maxHeap=8175), [Lorg.apache.hadoop.hbase.HMsg;@520ed128, [Lorg.apache.hadoop.hbase.HRegionInfo;@4ac5c32e) from 157.5.100.4:50187: output error 2011-05-08 17:44:25,745 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server handler 11 on 60000 caught: java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324) at org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1341) at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:727) at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:792) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1083)
        Hide
        gaojinchao added a comment -

        Hi, Stack. I made a mistake above analysis.
        I read the code again, The root reason is spliting message repeated.

        2011-05-08 17:42:45,514 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467

        //Closed the region.
        2011-05-08 17:43:37,599 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. (offlining)

        2011-05-08 17:43:45,525 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467

        // set RIT state and sent a message
        2011-05-08 17:44:25,745 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=C4C4.site,60020,1304820199467, load=(requests=0, regions=123, usedHeap=4097, maxHeap=8175) for region ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66.

        //Received split message again and RIT was deleted . So it could not process closed event.

        2011-05-08 17:44:45,530 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467
        2011-05-08 17:45:45,542 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467

        //
        2011-05-08 17:46:45,303 WARN org.apache.hadoop.hbase.master.AssignmentManager: Overwriting 4418fb197685a21f77e151e401cf8b66 on serverName=C4C4.site,60020,1304820199467, load=(requests=0, regions=123, usedHeap=4097, maxHeap=8175)
        2011-05-08 17:46:45,538 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467
        2011-05-08 17:47:45,548 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467
        2011-05-08 17:48:45,545 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467
        2011-05-08 17:49:46,108 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467
        2011-05-08 17:50:46,105 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467
        2011-05-08 17:51:46,117 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467
        2011-05-08 17:52:46,112 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467
        2011-05-08 17:52:47,309 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:60000-0x22fcd582836003d Retrieved 125 byte(s) of data from znode /hbase/unassigned/4418fb197685a21f77e151e401cf8b66 and set watcher; region=ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66., server=C4C4.site,60020,1304820199467, state=RS_ZK_REGION_CLOSED
        2011-05-08 17:52:47,388 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/4418fb197685a21f77e151e401cf8b66 (region=ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66., server=C4C4.site,60020,1304820199467, state=RS_ZK_REGION_CLOSED)
        2011-05-08 17:52:47,388 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=C4C4.site,60020,1304820199467, region=4418fb197685a21f77e151e401cf8b66
        2011-05-08 17:52:47,388 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 4418fb197685a21f77e151e401cf8b66 from server C4C4.site,60020,1304820199467 but region was in the state null and not in expected PENDING_CLOSE or CLOSING states
        2011-05-08 17:52:47,397 WARN org.apache.hadoop.hbase.master.AssignmentManager: Overwriting 4418fb197685a21f77e151e401cf8b66 on serverName=C4C4.site,60020,1304820199467, load=(requests=0, regions=123, usedHeap=4097, maxHeap=8175)

        Show
        gaojinchao added a comment - Hi, Stack. I made a mistake above analysis. I read the code again, The root reason is spliting message repeated. 2011-05-08 17:42:45,514 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 //Closed the region. 2011-05-08 17:43:37,599 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. (offlining) 2011-05-08 17:43:45,525 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 // set RIT state and sent a message 2011-05-08 17:44:25,745 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=C4C4.site,60020,1304820199467, load=(requests=0, regions=123, usedHeap=4097, maxHeap=8175) for region ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. //Received split message again and RIT was deleted . So it could not process closed event. 2011-05-08 17:44:45,530 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:45:45,542 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 // 2011-05-08 17:46:45,303 WARN org.apache.hadoop.hbase.master.AssignmentManager: Overwriting 4418fb197685a21f77e151e401cf8b66 on serverName=C4C4.site,60020,1304820199467, load=(requests=0, regions=123, usedHeap=4097, maxHeap=8175) 2011-05-08 17:46:45,538 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:47:45,548 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:48:45,545 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:49:46,108 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:50:46,105 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:51:46,117 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:52:46,112 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: ufdr,2011050812#8613817306227#0516,1304845660567.8e9a3b05abe1c3a692999cf5e8dfd9dd.: Daughters; ufdr,2011050812#8613817306227#0516,1304847764729.5e4bca85c33fa6605ffc9a5c2eb94e62., ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66. from C4C4.site,60020,1304820199467 2011-05-08 17:52:47,309 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:60000-0x22fcd582836003d Retrieved 125 byte(s) of data from znode /hbase/unassigned/4418fb197685a21f77e151e401cf8b66 and set watcher; region=ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66., server=C4C4.site,60020,1304820199467, state=RS_ZK_REGION_CLOSED 2011-05-08 17:52:47,388 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/4418fb197685a21f77e151e401cf8b66 (region=ufdr,2011050812#8613817398167#4032,1304847764729.4418fb197685a21f77e151e401cf8b66., server=C4C4.site,60020,1304820199467, state=RS_ZK_REGION_CLOSED) 2011-05-08 17:52:47,388 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=C4C4.site,60020,1304820199467, region=4418fb197685a21f77e151e401cf8b66 2011-05-08 17:52:47,388 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 4418fb197685a21f77e151e401cf8b66 from server C4C4.site,60020,1304820199467 but region was in the state null and not in expected PENDING_CLOSE or CLOSING states 2011-05-08 17:52:47,397 WARN org.apache.hadoop.hbase.master.AssignmentManager: Overwriting 4418fb197685a21f77e151e401cf8b66 on serverName=C4C4.site,60020,1304820199467, load=(requests=0, regions=123, usedHeap=4097, maxHeap=8175)
        Hide
        gaojinchao added a comment -

        I see, Regionserver do a setData with adding version number. Hmaster refresh the data only. and stat.RS_ZK_REGION_CLOSED is last state that shouldn't be update.
        It seems no a race condition.
        Hmaster do a setdata of state RS_ZK_REGION_CLOSED is not good method.
        I try to find other method to deal with this case. But It seems diffcult to do.

        Show
        gaojinchao added a comment - I see, Regionserver do a setData with adding version number. Hmaster refresh the data only. and stat.RS_ZK_REGION_CLOSED is last state that shouldn't be update. It seems no a race condition. Hmaster do a setdata of state RS_ZK_REGION_CLOSED is not good method. I try to find other method to deal with this case. But It seems diffcult to do.
        Hide
        stack added a comment -

        Gao:

        Half of this patch is white space changes. Can you make it so you do not disturb existing code? (It looks like you left-shifted a bunch of code by one space).

        My original review would seem to apply to this version of the patch (Am I reviewing the same patch?). You are having the master do a setData. In particular, you are having the master do a setData of state RS_ZK_REGION_CLOSED. Only the regionserver does this usually. If the master does it, and somehow the regionserver does it too around the same time, then you have introduced a race condition.

        Show
        stack added a comment - Gao: Half of this patch is white space changes. Can you make it so you do not disturb existing code? (It looks like you left-shifted a bunch of code by one space). My original review would seem to apply to this version of the patch (Am I reviewing the same patch?). You are having the master do a setData. In particular, you are having the master do a setData of state RS_ZK_REGION_CLOSED. Only the regionserver does this usually. If the master does it, and somehow the regionserver does it too around the same time, then you have introduced a race condition.
        Hide
        gaojinchao added a comment -

        I finished test, please review it. Thanks

        Show
        gaojinchao added a comment - I finished test, please review it. Thanks
        Hide
        gaojinchao added a comment -

        I am not familiar with zk api and have learned it now. I make a patch again.
        I want to use api "setData(ZooKeeperWatcher zkw, String znode,
        byte [] data)". It seems dangerous for parallel operation.
        I want to verify more carefully in next week.

        Show
        gaojinchao added a comment - I am not familiar with zk api and have learned it now. I make a patch again. I want to use api "setData(ZooKeeperWatcher zkw, String znode, byte [] data)". It seems dangerous for parallel operation. I want to verify more carefully in next week.
        Hide
        stack added a comment -

        FYI, we write else like this:

        } else {
        

        rather than like this:

        }
        else
        {
        

        Not important though.

        Since the below

        String node = ZKAssign.getNodeName(watcher,
        +                             regionInfo.getEncodedName());
        

        is used by the if and the else, maybe it would make sense to get this before we do the if or else?

        The message here is wrong?

        +                          if (data.getEventType() != EventType.RS_ZK_REGION_CLOSED) {
        +                             LOG.debug("Region has transitioned to CLOEING, allowing " +
        +                                  "watched event handlers to process");
        +                             break;
        

        Can you say for sure the region is in CLOSING state? Maybe its in some other state altogether?

        The below is not a good idea. The master is setting a state that is only usually set by the regionserver:

        +                         data = new RegionTransitionData(
        +                                 EventType.RS_ZK_REGION_CLOSED, regionInfo.getRegionName(),
        +                                    master.getServerName());
        

        Do you have to make a new RTD here? Can't you use the one you just decoded a few lines earlier?

        And why are you calling a setData on this with an upped version number?

        I think I understand now what you are trying to fix. It looks like a genuine issue to me and a patch like this one is needed. Thanks for working on this Gao.

        Show
        stack added a comment - FYI, we write else like this: } else { rather than like this: } else { Not important though. Since the below String node = ZKAssign.getNodeName(watcher, + regionInfo.getEncodedName()); is used by the if and the else, maybe it would make sense to get this before we do the if or else? The message here is wrong? + if (data.getEventType() != EventType.RS_ZK_REGION_CLOSED) { + LOG.debug( "Region has transitioned to CLOEING, allowing " + + "watched event handlers to process" ); + break ; Can you say for sure the region is in CLOSING state? Maybe its in some other state altogether? The below is not a good idea. The master is setting a state that is only usually set by the regionserver: + data = new RegionTransitionData( + EventType.RS_ZK_REGION_CLOSED, regionInfo.getRegionName(), + master.getServerName()); Do you have to make a new RTD here? Can't you use the one you just decoded a few lines earlier? And why are you calling a setData on this with an upped version number? I think I understand now what you are trying to fix. It looks like a genuine issue to me and a patch like this one is needed. Thanks for working on this Gao.

          People

          • Assignee:
            Unassigned
            Reporter:
            gaojinchao
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development