HBase
  1. HBase
  2. HBASE-10575

ReplicationSource thread can't be terminated if it runs into the loop to contact peer's zk ensemble and fails continuously

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.98.1, 0.99.0, 0.94.17
    • Fix Version/s: 0.96.2, 0.98.1, 0.99.0, 0.94.18
    • Component/s: Replication
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      When ReplicationSource thread runs into the loop to contact peer's zk ensemble, it doesn't check isActive() before each retry, so if the given peer's zk ensemble is not reachable due to some reason, this ReplicationSource thread just can't be terminated by outside such as removePeer etc.

      1. 10575.txt
        1 kB
        Lars Hofhansl
      2. HBASE-10575-trunk_v1.patch
        2 kB
        Honghua Feng

        Activity

        Lars Hofhansl made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Hide
        Hudson added a comment -

        FAILURE: Integrated in HBase-0.94-JDK7 #67 (See https://builds.apache.org/job/HBase-0.94-JDK7/67/)
        HBASE-10575 ReplicationSource thread can't be terminated if it runs into the loop to contact peer's zk ensemble and fails continuously (Feng Honghua & LarsH) (larsh: rev 1572441)

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
        Show
        Hudson added a comment - FAILURE: Integrated in HBase-0.94-JDK7 #67 (See https://builds.apache.org/job/HBase-0.94-JDK7/67/ ) HBASE-10575 ReplicationSource thread can't be terminated if it runs into the loop to contact peer's zk ensemble and fails continuously (Feng Honghua & LarsH) (larsh: rev 1572441) /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
        Hide
        Hudson added a comment -

        FAILURE: Integrated in HBase-0.94-on-Hadoop-2 #35 (See https://builds.apache.org/job/HBase-0.94-on-Hadoop-2/35/)
        HBASE-10575 ReplicationSource thread can't be terminated if it runs into the loop to contact peer's zk ensemble and fails continuously (Feng Honghua & LarsH) (larsh: rev 1572441)

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
        Show
        Hudson added a comment - FAILURE: Integrated in HBase-0.94-on-Hadoop-2 #35 (See https://builds.apache.org/job/HBase-0.94-on-Hadoop-2/35/ ) HBASE-10575 ReplicationSource thread can't be terminated if it runs into the loop to contact peer's zk ensemble and fails continuously (Feng Honghua & LarsH) (larsh: rev 1572441) /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in HBase-0.94 #1303 (See https://builds.apache.org/job/HBase-0.94/1303/)
        HBASE-10575 ReplicationSource thread can't be terminated if it runs into the loop to contact peer's zk ensemble and fails continuously (Feng Honghua & LarsH) (larsh: rev 1572441)

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
        Show
        Hudson added a comment - SUCCESS: Integrated in HBase-0.94 #1303 (See https://builds.apache.org/job/HBase-0.94/1303/ ) HBASE-10575 ReplicationSource thread can't be terminated if it runs into the loop to contact peer's zk ensemble and fails continuously (Feng Honghua & LarsH) (larsh: rev 1572441) /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in HBase-0.94-security #425 (See https://builds.apache.org/job/HBase-0.94-security/425/)
        HBASE-10575 ReplicationSource thread can't be terminated if it runs into the loop to contact peer's zk ensemble and fails continuously (Feng Honghua & LarsH) (larsh: rev 1572441)

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
        Show
        Hudson added a comment - SUCCESS: Integrated in HBase-0.94-security #425 (See https://builds.apache.org/job/HBase-0.94-security/425/ ) HBASE-10575 ReplicationSource thread can't be terminated if it runs into the loop to contact peer's zk ensemble and fails continuously (Feng Honghua & LarsH) (larsh: rev 1572441) /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
        Lars Hofhansl made changes -
        Fix Version/s 0.94.18 [ 12325952 ]
        Hide
        Lars Hofhansl added a comment -

        Committed to 0.94 as well.

        Show
        Lars Hofhansl added a comment - Committed to 0.94 as well.
        Lars Hofhansl made changes -
        Attachment 10575.txt [ 12631459 ]
        Hide
        Lars Hofhansl added a comment -

        What I am planning to commit to 0.94.

        Show
        Lars Hofhansl added a comment - What I am planning to commit to 0.94.
        Hide
        Hudson added a comment -

        FAILURE: Integrated in HBase-TRUNK-on-Hadoop-1.1 #99 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-1.1/99/)
        HBASE-10575 ReplicationSource thread can't be terminated if it runs into the loop to contact peer's zk ensemble and fails continuously (stack: rev 1571579)

        • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
        Show
        Hudson added a comment - FAILURE: Integrated in HBase-TRUNK-on-Hadoop-1.1 #99 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-1.1/99/ ) HBASE-10575 ReplicationSource thread can't be terminated if it runs into the loop to contact peer's zk ensemble and fails continuously (stack: rev 1571579) /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in HBase-0.98 #184 (See https://builds.apache.org/job/HBase-0.98/184/)
        HBASE-10575 ReplicationSource thread can't be terminated if it runs into the loop to contact peer's zk ensemble and fails continuously (stack: rev 1571580)

        • /hbase/branches/0.98/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
        Show
        Hudson added a comment - SUCCESS: Integrated in HBase-0.98 #184 (See https://builds.apache.org/job/HBase-0.98/184/ ) HBASE-10575 ReplicationSource thread can't be terminated if it runs into the loop to contact peer's zk ensemble and fails continuously (stack: rev 1571580) /hbase/branches/0.98/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
        Hide
        Hudson added a comment -

        FAILURE: Integrated in hbase-0.96-hadoop2 #215 (See https://builds.apache.org/job/hbase-0.96-hadoop2/215/)
        HBASE-10575 ReplicationSource thread can't be terminated if it runs into the loop to contact peer's zk ensemble and fails continuously (stack: rev 1571582)

        • /hbase/branches/0.96/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
        Show
        Hudson added a comment - FAILURE: Integrated in hbase-0.96-hadoop2 #215 (See https://builds.apache.org/job/hbase-0.96-hadoop2/215/ ) HBASE-10575 ReplicationSource thread can't be terminated if it runs into the loop to contact peer's zk ensemble and fails continuously (stack: rev 1571582) /hbase/branches/0.96/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
        Hide
        Hudson added a comment -

        FAILURE: Integrated in HBase-0.98-on-Hadoop-1.1 #171 (See https://builds.apache.org/job/HBase-0.98-on-Hadoop-1.1/171/)
        HBASE-10575 ReplicationSource thread can't be terminated if it runs into the loop to contact peer's zk ensemble and fails continuously (stack: rev 1571580)

        • /hbase/branches/0.98/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
        Show
        Hudson added a comment - FAILURE: Integrated in HBase-0.98-on-Hadoop-1.1 #171 (See https://builds.apache.org/job/HBase-0.98-on-Hadoop-1.1/171/ ) HBASE-10575 ReplicationSource thread can't be terminated if it runs into the loop to contact peer's zk ensemble and fails continuously (stack: rev 1571580) /hbase/branches/0.98/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
        Hide
        Hudson added a comment -

        FAILURE: Integrated in HBase-TRUNK #4952 (See https://builds.apache.org/job/HBase-TRUNK/4952/)
        HBASE-10575 ReplicationSource thread can't be terminated if it runs into the loop to contact peer's zk ensemble and fails continuously (stack: rev 1571579)

        • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
        Show
        Hudson added a comment - FAILURE: Integrated in HBase-TRUNK #4952 (See https://builds.apache.org/job/HBase-TRUNK/4952/ ) HBASE-10575 ReplicationSource thread can't be terminated if it runs into the loop to contact peer's zk ensemble and fails continuously (stack: rev 1571579) /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
        Hide
        Hudson added a comment -

        FAILURE: Integrated in hbase-0.96 #313 (See https://builds.apache.org/job/hbase-0.96/313/)
        HBASE-10575 ReplicationSource thread can't be terminated if it runs into the loop to contact peer's zk ensemble and fails continuously (stack: rev 1571582)

        • /hbase/branches/0.96/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
        Show
        Hudson added a comment - FAILURE: Integrated in hbase-0.96 #313 (See https://builds.apache.org/job/hbase-0.96/313/ ) HBASE-10575 ReplicationSource thread can't be terminated if it runs into the loop to contact peer's zk ensemble and fails continuously (stack: rev 1571582) /hbase/branches/0.96/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
        Hide
        stack added a comment -

        Oh, thanks for the patch Honghua Feng

        Show
        stack added a comment - Oh, thanks for the patch Honghua Feng
        stack made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Hadoop Flags Reviewed [ 10343 ]
        Fix Version/s 0.94.18 [ 12325952 ]
        Resolution Fixed [ 1 ]
        Hide
        stack added a comment -

        I committed to 0.96-0.99. Doesn't apply to 0.94 so skipping on it for now. I left the method named uninitialize.

        Show
        stack added a comment - I committed to 0.96-0.99. Doesn't apply to 0.94 so skipping on it for now. I left the method named uninitialize.
        Hide
        Honghua Feng added a comment -

        Lars Hofhansl, thanks for the review!

        Can it be committed, or any further feedback? Thanks

        Show
        Honghua Feng added a comment - Lars Hofhansl , thanks for the review! Can it be committed, or any further feedback? Thanks
        Hide
        Honghua Feng added a comment -

        simply 'close', thanks...I meant 'close + logging' for 'cleanup' in above comment, a misuse of word 'cleanup'?

        Show
        Honghua Feng added a comment - simply 'close', thanks...I meant 'close + logging' for 'cleanup' in above comment, a misuse of word 'cleanup'?
        Hide
        Lars Hofhansl added a comment - - edited

        "cleanup" or simply "close"?

        Show
        Lars Hofhansl added a comment - - edited "cleanup" or simply "close"?
        Hide
        Honghua Feng added a comment -

        I would probably rename "uninitialize" to "terminate", otherwise looks good to me.

        You meant the refactored 'uninitialize' method? hmmm...IMHO 'uninitialize' is more accurate than 'terminate' in that it only does cleanup of closing connection and logging before the containing thread being terminated, this method itself not directly terminates the replication thread, and actually there is already a terminate method which is used by ReplicationManager to terminate a replication thread from outside.

        Show
        Honghua Feng added a comment - I would probably rename "uninitialize" to "terminate", otherwise looks good to me. You meant the refactored 'uninitialize' method? hmmm...IMHO 'uninitialize' is more accurate than 'terminate' in that it only does cleanup of closing connection and logging before the containing thread being terminated, this method itself not directly terminates the replication thread, and actually there is already a terminate method which is used by ReplicationManager to terminate a replication thread from outside.
        Hide
        Lars Hofhansl added a comment -

        Looks good. I would probably rename "uninitialize" to "terminate", otherwise looks good to me.
        Straight bug fix, so adding 0.94 and 0.96 as well.

        Show
        Lars Hofhansl added a comment - Looks good. I would probably rename "uninitialize" to "terminate", otherwise looks good to me. Straight bug fix, so adding 0.94 and 0.96 as well.
        Lars Hofhansl made changes -
        Fix Version/s 0.96.2 [ 12325658 ]
        Fix Version/s 0.94.18 [ 12325952 ]
        Hide
        Honghua Feng added a comment -

        Ping for another +1 for this jira to be committed? thanks!

        Show
        Honghua Feng added a comment - Ping for another +1 for this jira to be committed? thanks!
        Andrew Purtell made changes -
        Fix Version/s 0.98.1 [ 12325664 ]
        Hide
        Andrew Purtell added a comment -

        That test has failed in other precommit builds also, seems unrelated.

        +1 for 0.98 branch also

        Show
        Andrew Purtell added a comment - That test has failed in other precommit builds also, seems unrelated. +1 for 0.98 branch also
        Hide
        Honghua Feng added a comment -

        unit tests pass in my local run, and the failed cases look like have nothing to do with the patch...weird

        Show
        Honghua Feng added a comment - unit tests pass in my local run, and the failed cases look like have nothing to do with the patch...weird
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12630020/HBASE-10575-trunk_v1.patch
        against trunk revision .
        ATTACHMENT ID: 12630020

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 hadoop1.0. The patch compiles against the hadoop 1.0 profile.

        +1 hadoop1.1. The patch compiles against the hadoop 1.1 profile.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 lineLengths. The patch does not introduce lines longer than 100

        +1 site. The mvn site goal succeeds with this patch.

        -1 core tests. The patch failed these unit tests:
        org.apache.hadoop.hbase.snapshot.TestFlushSnapshotFromClient

        -1 core zombie tests. There are 1 zombie test(s): at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:354)

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/8755//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8755//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8755//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-thrift.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8755//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8755//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8755//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8755//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8755//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8755//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8755//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/8755//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12630020/HBASE-10575-trunk_v1.patch against trunk revision . ATTACHMENT ID: 12630020 +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 hadoop1.0 . The patch compiles against the hadoop 1.0 profile. +1 hadoop1.1 . The patch compiles against the hadoop 1.1 profile. +1 javadoc . The javadoc tool did not generate any warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 lineLengths . The patch does not introduce lines longer than 100 +1 site . The mvn site goal succeeds with this patch. -1 core tests . The patch failed these unit tests: org.apache.hadoop.hbase.snapshot.TestFlushSnapshotFromClient -1 core zombie tests . There are 1 zombie test(s): at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:354) Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/8755//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8755//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8755//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8755//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8755//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8755//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8755//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8755//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8755//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8755//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/8755//console This message is automatically generated.
        Honghua Feng made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Honghua Feng added a comment -

        Looks like all branches have this same bug. I have checked 0.94, 0.98 and 0.99...

        Show
        Honghua Feng added a comment - Looks like all branches have this same bug. I have checked 0.94, 0.98 and 0.99...
        Honghua Feng made changes -
        Attachment HBASE-10575-trunk_v1.patch [ 12630020 ]
        Hide
        Honghua Feng added a comment -

        Patch attached for the fix

        And two minor changes

        1. exit immediately without sleep if isActive()==false after each failed try
        2. close this.conn and print ReplicationSource exiting log for premature thread-exit as well
        Show
        Honghua Feng added a comment - Patch attached for the fix And two minor changes exit immediately without sleep if isActive()==false after each failed try close this.conn and print ReplicationSource exiting log for premature thread-exit as well
        Honghua Feng made changes -
        Summary ReplicationSource thread can't be terminated if it runs into the loop and fails to contact peer's zk ensemble continuously ReplicationSource thread can't be terminated if it runs into the loop to contact peer's zk ensemble and fails continuously
        Honghua Feng made changes -
        Field Original Value New Value
        Priority Major [ 3 ] Critical [ 2 ]
        Honghua Feng created issue -

          People

          • Assignee:
            Honghua Feng
            Reporter:
            Honghua Feng
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development