HBase
  1. HBase
  2. HBASE-5606

SplitLogManger async delete node hangs log splitting when ZK connection is lost

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.92.0
    • Fix Version/s: 0.92.2, 0.94.0, 0.95.0
    • Component/s: wal
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      1. One rs died, the servershutdownhandler found it out and started the distributed log splitting;
      2. All tasks are failed due to ZK connection lost, so the all the tasks were deleted asynchronously;
      3. Servershutdownhandler retried the log splitting;
      4. The asynchronously deletion in step 2 finally happened for new task
      5. This made the SplitLogManger in hanging state.

      This leads to .META. region not assigened for long time

      hbase-root-master-HOST-192-168-47-204.log.2012-03-14"(55413,79):2012-03-14 19:28:47,932 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: put up splitlog task at znode /hbase/splitlog/hdfs%3A%2F%2F192.168.47.205%3A9000%2Fhbase%2F.logs%2Flinux-114.site%2C60020%2C1331720381665-splitting%2Flinux-114.site%252C60020%252C1331720381665.1331752316170
      hbase-root-master-HOST-192-168-47-204.log.2012-03-14"(89303,79):2012-03-14 19:34:32,387 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: put up splitlog task at znode /hbase/splitlog/hdfs%3A%2F%2F192.168.47.205%3A9000%2Fhbase%2F.logs%2Flinux-114.site%2C60020%2C1331720381665-splitting%2Flinux-114.site%252C60020%252C1331720381665.1331752316170
      
      hbase-root-master-HOST-192-168-47-204.log.2012-03-14"(80417,99):2012-03-14 19:34:31,196 DEBUG org.apache.hadoop.hbase.master.SplitLogManager$DeleteAsyncCallback: deleted /hbase/splitlog/hdfs%3A%2F%2F192.168.47.205%3A9000%2Fhbase%2F.logs%2Flinux-114.site%2C60020%2C1331720381665-splitting%2Flinux-114.site%252C60020%252C1331720381665.1331752316170
      hbase-root-master-HOST-192-168-47-204.log.2012-03-14"(89456,99):2012-03-14 19:34:32,497 DEBUG org.apache.hadoop.hbase.master.SplitLogManager$DeleteAsyncCallback: deleted /hbase/splitlog/hdfs%3A%2F%2F192.168.47.205%3A9000%2Fhbase%2F.logs%2Flinux-114.site%2C60020%2C1331720381665-splitting%2Flinux-114.site%252C60020%252C1331720381665.1331752316170
      

        Activity

        Hide
        stack added a comment -

        Pretty bad. Look to fix for 0.92.2

        Show
        stack added a comment - Pretty bad. Look to fix for 0.92.2
        Hide
        Chinna Rao Lalam added a comment -

        This situation can come in 0.92 when

        1)First time SplitLogManager installed one task after that it is not able to connect to Zookeeper(Because of CONNECTIONLOSS).

        so the GetDataAsyncCallback will fail and it will retry which is register at the time of createNode() in installTask() or in TimeoutMonitor.

        19:32:24,657 WARN org.apache.hadoop.hbase.master.SplitLogManager$GetDataAsyncCallback: getdata rc = CONNECTIONLOSS /hbase/splitlog/hdfs%3A%2F%2F192.168.47.205%3A9000%2Fhbase%2F.logs%2Flinux-114.site%2C60020.1331752316170 retry=0
        

        2)When ever the GetDataAsyncCallback retry=0 it will call setDone() here it will increment batch.error and it will register one DeleteAsyncCallback.

        3)So here installed != done so SplilogManger will throw exception and it will submit again.

        4)"failed to set data watch" is happened 92 times so 92 DeleteAsyncCallback are registered and all 92 DeleteAsyncCallback will try till it success.

        19:34:30,874 WARN org.apache.hadoop.hbase.master.SplitLogManager: failed to set data watch /hbase/splitlog/hdfs%3A%2F%2F192.168.47.205%3A9000%2Fhbase%2F.logs%2Flinux-114.site%2C60020%2C1331720381665-splitting%2Flinux-114.site%252C60020%252C1331720381665.1331752316170
        

        5) Because of Point:3 SplitLogManager will try to install the task but it found already installed task is FAILURE so it is waiting to change to DELETED

        6)Once it got the Zookeer connection one of the DeleteAsyncCallback deleted the node and it will notify the task which is waiting at Point:5

        19:34:31,196 DEBUG org.apache.hadoop.hbase.master.SplitLogManager$DeleteAsyncCallback: deleted /hbase/splitlog/hdfs%3A%2F%2F192.168.47.205%3A9000%2Fhbase%2F.logs%2Flinux-114.site%2C60020%2C1331720381665-splitting%2Flinux-114.site%252C60020%252C1331720381665.1331752316170
        

        7) Point:5 after notified it will crete the node

        19:34:32,387 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: put up splitlog task at znode /hbase/splitlog/hdfs%3A%2F%2F192.168.47.205%3A9000%2Fhbase%2F.logs%2Flinux-114.site%2C60020%2C1331720381665-splitting%2Flinux-114.site%252C60020%252C1331720381665.1331752316170
        

        8) But already registered DeleteAsyncCallback's will execute and it will delete newly created node at Point:7

        19:34:32,497 DEBUG org.apache.hadoop.hbase.master.SplitLogManager$DeleteAsyncCallback: deleted /hbase/splitlog/hdfs%3A%2F%2F192.168.47.205%3A9000%2Fhbase%2F.logs%2Flinux-114.site%2C60020%2C1331720381665-splitting%2Flinux-114.site%252C60020%252C1331720381665.1331752316170
        

        9) So because of the node is deleted and it removed form tasks map flow wont come to the piece of code to increment the batch.done or batch.error in setDone().
        So waitTask() will be in infinite looping and it wont come out.

        "MASTER_META_SERVER_OPERATIONS-HOST-192-168-47-204,60000,1331719909985-1" prio=10 tid=0x0000000040d7c000 nid=0x624b in Object.wait() [0x00007ff090482000]
           java.lang.Thread.State: TIMED_WAITING (on object monitor)
        	at java.lang.Object.wait(Native Method)
        	at org.apache.hadoop.hbase.master.SplitLogManager.waitTasks(SplitLogManager.java:316)
        	- locked <0x000000078e6c4258> (a org.apache.hadoop.hbase.master.SplitLogManager$TaskBatch)
        	at org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:262)
        
        Show
        Chinna Rao Lalam added a comment - This situation can come in 0.92 when 1)First time SplitLogManager installed one task after that it is not able to connect to Zookeeper(Because of CONNECTIONLOSS). so the GetDataAsyncCallback will fail and it will retry which is register at the time of createNode() in installTask() or in TimeoutMonitor. 19:32:24,657 WARN org.apache.hadoop.hbase.master.SplitLogManager$GetDataAsyncCallback: getdata rc = CONNECTIONLOSS /hbase/splitlog/hdfs%3A%2F%2F192.168.47.205%3A9000%2Fhbase%2F.logs%2Flinux-114.site%2C60020.1331752316170 retry=0 2)When ever the GetDataAsyncCallback retry=0 it will call setDone() here it will increment batch.error and it will register one DeleteAsyncCallback. 3)So here installed != done so SplilogManger will throw exception and it will submit again. 4)"failed to set data watch" is happened 92 times so 92 DeleteAsyncCallback are registered and all 92 DeleteAsyncCallback will try till it success. 19:34:30,874 WARN org.apache.hadoop.hbase.master.SplitLogManager: failed to set data watch /hbase/splitlog/hdfs%3A%2F%2F192.168.47.205%3A9000%2Fhbase%2F.logs%2Flinux-114.site%2C60020%2C1331720381665-splitting%2Flinux-114.site%252C60020%252C1331720381665.1331752316170 5) Because of Point:3 SplitLogManager will try to install the task but it found already installed task is FAILURE so it is waiting to change to DELETED 6)Once it got the Zookeer connection one of the DeleteAsyncCallback deleted the node and it will notify the task which is waiting at Point:5 19:34:31,196 DEBUG org.apache.hadoop.hbase.master.SplitLogManager$DeleteAsyncCallback: deleted /hbase/splitlog/hdfs%3A%2F%2F192.168.47.205%3A9000%2Fhbase%2F.logs%2Flinux-114.site%2C60020%2C1331720381665-splitting%2Flinux-114.site%252C60020%252C1331720381665.1331752316170 7) Point:5 after notified it will crete the node 19:34:32,387 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: put up splitlog task at znode /hbase/splitlog/hdfs%3A%2F%2F192.168.47.205%3A9000%2Fhbase%2F.logs%2Flinux-114.site%2C60020%2C1331720381665-splitting%2Flinux-114.site%252C60020%252C1331720381665.1331752316170 8) But already registered DeleteAsyncCallback's will execute and it will delete newly created node at Point:7 19:34:32,497 DEBUG org.apache.hadoop.hbase.master.SplitLogManager$DeleteAsyncCallback: deleted /hbase/splitlog/hdfs%3A%2F%2F192.168.47.205%3A9000%2Fhbase%2F.logs%2Flinux-114.site%2C60020%2C1331720381665-splitting%2Flinux-114.site%252C60020%252C1331720381665.1331752316170 9) So because of the node is deleted and it removed form tasks map flow wont come to the piece of code to increment the batch.done or batch.error in setDone(). So waitTask() will be in infinite looping and it wont come out. "MASTER_META_SERVER_OPERATIONS-HOST-192-168-47-204,60000,1331719909985-1" prio=10 tid=0x0000000040d7c000 nid=0x624b in Object.wait() [0x00007ff090482000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) at org.apache.hadoop.hbase.master.SplitLogManager.waitTasks(SplitLogManager.java:316) - locked <0x000000078e6c4258> (a org.apache.hadoop.hbase.master.SplitLogManager$TaskBatch) at org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:262)
        Hide
        Chinna Rao Lalam added a comment -

        I am trying to fix this any suggestions pls welcome..

        Show
        Chinna Rao Lalam added a comment - I am trying to fix this any suggestions pls welcome..
        Hide
        ramkrishna.s.vasudevan added a comment -

        It is similar with HBASE-5081 I feel.

        Show
        ramkrishna.s.vasudevan added a comment - It is similar with HBASE-5081 I feel.
        Hide
        Prakash Khemani added a comment -

        @Chinna

        It is the TimeoutMonitor that causes the so many Deletes to be queued.

        The fix will be the following

        In TimeoutMonitor do not call getDataSetWatch() if the task has already failed.

        Ignore the call to getDataSetWatch() if there is already a pending getDataSetWatch against the task.

        Thanks for finding this issue.

        Show
        Prakash Khemani added a comment - @Chinna It is the TimeoutMonitor that causes the so many Deletes to be queued. The fix will be the following In TimeoutMonitor do not call getDataSetWatch() if the task has already failed. Ignore the call to getDataSetWatch() if there is already a pending getDataSetWatch against the task. Thanks for finding this issue.
        Hide
        Ted Yu added a comment -

        Initial patch according to Prakash's suggestion.

        Still need to find out the API querying whether there is outstanding watcher for a specific path

        Show
        Ted Yu added a comment - Initial patch according to Prakash's suggestion. Still need to find out the API querying whether there is outstanding watcher for a specific path
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12519319/5606.txt
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 3 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests:
        org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster
        org.apache.hadoop.hbase.mapreduce.TestImportTsv
        org.apache.hadoop.hbase.mapred.TestTableMapReduce
        org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat
        org.apache.hadoop.hbase.master.TestSplitLogManager

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1244//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1244//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1244//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12519319/5606.txt against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 3 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster org.apache.hadoop.hbase.mapreduce.TestImportTsv org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat org.apache.hadoop.hbase.master.TestSplitLogManager Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1244//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1244//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1244//console This message is automatically generated.
        Hide
        Prakash Khemani added a comment -

        The getDataSetWatch() call in the timeout-monitor is only being done to check whether the znode still exists or not. If there is a failure in getting to the znode then we should ignore that failure.

        How about implementing the following

        in timeoutmonitor
        call getDataSetWatch() only if task has not already failed. (This is just an optimization and it can be done without any locking)

        for this particular getDataSetWatch() call, store a IGNORE-ZK-ERROR flag in the zk async context. If a zk error happens silently then do nothing.

        Show
        Prakash Khemani added a comment - The getDataSetWatch() call in the timeout-monitor is only being done to check whether the znode still exists or not. If there is a failure in getting to the znode then we should ignore that failure. How about implementing the following in timeoutmonitor call getDataSetWatch() only if task has not already failed. (This is just an optimization and it can be done without any locking) for this particular getDataSetWatch() call, store a IGNORE-ZK-ERROR flag in the zk async context. If a zk error happens silently then do nothing.
        Hide
        Chinna Rao Lalam added a comment -

        @Prakash
        Thanks prakash for the points

        in timeoutmonitor call getDataSetWatch() only if task has not already failed. (This is just an optimization and it can be done without any locking)

        Here i am thinking call for getDataSetWatch() in timeoutmonitor should be sysnchronized because race condition may come in setDone() call and getDataSetWatch() in timeoutmonitor call.

        for this particular getDataSetWatch() call, store a IGNORE-ZK-ERROR flag in the zk async context. If a zk error happens silently then do nothing.

        can u elaborate little more on this point. Normally if any error comes we are retrying this. Now by introducing this IGNORE-ZK-ERROR need to skip the retry, who will set this and when can this flag be true? when task is FAILURE it will be true? (if my understanding is not wrong).

        Show
        Chinna Rao Lalam added a comment - @Prakash Thanks prakash for the points in timeoutmonitor call getDataSetWatch() only if task has not already failed. (This is just an optimization and it can be done without any locking) Here i am thinking call for getDataSetWatch() in timeoutmonitor should be sysnchronized because race condition may come in setDone() call and getDataSetWatch() in timeoutmonitor call. for this particular getDataSetWatch() call, store a IGNORE-ZK-ERROR flag in the zk async context. If a zk error happens silently then do nothing. can u elaborate little more on this point. Normally if any error comes we are retrying this. Now by introducing this IGNORE-ZK-ERROR need to skip the retry, who will set this and when can this flag be true? when task is FAILURE it will be true? (if my understanding is not wrong).
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12519319/5606.txt
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 2 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests:
        org.apache.hadoop.hbase.mapreduce.TestImportTsv
        org.apache.hadoop.hbase.mapred.TestTableMapReduce
        org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1255//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1255//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1255//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12519319/5606.txt against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 2 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.mapreduce.TestImportTsv org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1255//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1255//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1255//console This message is automatically generated.
        Hide
        Jimmy Xiang added a comment -

        This is similar issue as HBASE-5081, right?

        Will my original fix proposed for HBASE-5081 help: don't retry distributed log splitting before tasks are actually deleted?
        We can abort the master after several retry to delete the tasks.

        Show
        Jimmy Xiang added a comment - This is similar issue as HBASE-5081 , right? Will my original fix proposed for HBASE-5081 help: don't retry distributed log splitting before tasks are actually deleted? We can abort the master after several retry to delete the tasks.
        Hide
        Prakash Khemani added a comment -

        Do not do any error processing if the getDataSetWatch() call from SplitLogManager timeoutMonitor fails

        Show
        Prakash Khemani added a comment - Do not do any error processing if the getDataSetWatch() call from SplitLogManager timeoutMonitor fails
        Hide
        Prakash Khemani added a comment -

        @Jimmy This is similar to HBASE-5081 w.r.t what goes wrong - a pending delete creates havoc on the next create. But it is different from HBASE-5081 because the pending Delete is created at a different point in the code - in the timeoutMonitor and not when the task actually fails ...

        Show
        Prakash Khemani added a comment - @Jimmy This is similar to HBASE-5081 w.r.t what goes wrong - a pending delete creates havoc on the next create. But it is different from HBASE-5081 because the pending Delete is created at a different point in the code - in the timeoutMonitor and not when the task actually fails ...
        Hide
        Jimmy Xiang added a comment -

        @Prakash, could there be other places which failed delete can cause this issue?

        Is it a cleaner fix to change async delete to sync delete? With sync delete, we can
        avoid all these havoc racing problems, and the retry will get a fresh start each time.

        Show
        Jimmy Xiang added a comment - @Prakash, could there be other places which failed delete can cause this issue? Is it a cleaner fix to change async delete to sync delete? With sync delete, we can avoid all these havoc racing problems, and the retry will get a fresh start each time.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12520003/0001-HBASE-5606-SplitLogManger-async-delete-node-hangs-lo.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 2 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests:
        org.apache.hadoop.hbase.io.hfile.TestForceCacheImportantBlocks
        org.apache.hadoop.hbase.mapreduce.TestImportTsv
        org.apache.hadoop.hbase.mapred.TestTableMapReduce
        org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat
        org.apache.hadoop.hbase.master.TestSplitLogManager

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1310//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1310//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1310//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12520003/0001-HBASE-5606-SplitLogManger-async-delete-node-hangs-lo.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 2 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.io.hfile.TestForceCacheImportantBlocks org.apache.hadoop.hbase.mapreduce.TestImportTsv org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat org.apache.hadoop.hbase.master.TestSplitLogManager Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1310//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1310//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1310//console This message is automatically generated.
        Hide
        Ted Yu added a comment -

        Re-attaching Prakash's patch.

        Show
        Ted Yu added a comment - Re-attaching Prakash's patch.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12520026/0001-HBASE-5606-SplitLogManger-async-delete-node-hangs-lo.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 2 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests:
        org.apache.hadoop.hbase.mapreduce.TestImportTsv
        org.apache.hadoop.hbase.mapred.TestTableMapReduce
        org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1314//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1314//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1314//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12520026/0001-HBASE-5606-SplitLogManger-async-delete-node-hangs-lo.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 2 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.mapreduce.TestImportTsv org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1314//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1314//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1314//console This message is automatically generated.
        Hide
        stack added a comment -

        ping on this patch. What we thinking here? Patch seems clean enough. Jimmy, you think there might be other places where we need to do similar? Prakash, what you think of Jimmy's suggestion of sync'ing the delete (I think I know what you are going to say!).

        Show
        stack added a comment - ping on this patch. What we thinking here? Patch seems clean enough. Jimmy, you think there might be other places where we need to do similar? Prakash, what you think of Jimmy's suggestion of sync'ing the delete (I think I know what you are going to say!).
        Hide
        Chinna Rao Lalam added a comment -

        For me also patch looks clean.

        Show
        Chinna Rao Lalam added a comment - For me also patch looks clean.
        Hide
        Jimmy Xiang added a comment -

        It is ok with me. Hopefully, there is no other place.

        Show
        Jimmy Xiang added a comment - It is ok with me. Hopefully, there is no other place.
        Hide
        Ted Yu added a comment -

        Will integrate later today if there is no objection.

        Show
        Ted Yu added a comment - Will integrate later today if there is no objection.
        Hide
        Prakash Khemani added a comment -

        Making the deletes synchronous doesn't theoretically remove the race condition. A master could send the delete to the zk-server it is connected to and die. The next master can (theoretically) still run into the pending delete race.

        Show
        Prakash Khemani added a comment - Making the deletes synchronous doesn't theoretically remove the race condition. A master could send the delete to the zk-server it is connected to and die. The next master can (theoretically) still run into the pending delete race.
        Hide
        Ted Yu added a comment -

        Integrated to 0.92, 0.94 and trunk.

        Thanks for the patch Prakash.

        Thanks for the review Stack, Jimmy and Chinna.

        Show
        Ted Yu added a comment - Integrated to 0.92, 0.94 and trunk. Thanks for the patch Prakash. Thanks for the review Stack, Jimmy and Chinna.
        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK #2707 (See https://builds.apache.org/job/HBase-TRUNK/2707/)
        HBASE-5606 SplitLogManger async delete node hangs log splitting when ZK connection is lost
        (Prakash) (Revision 1309173)

        Result = FAILURE
        tedyu :
        Files :

        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java
        Show
        Hudson added a comment - Integrated in HBase-TRUNK #2707 (See https://builds.apache.org/job/HBase-TRUNK/2707/ ) HBASE-5606 SplitLogManger async delete node hangs log splitting when ZK connection is lost (Prakash) (Revision 1309173) Result = FAILURE tedyu : Files : /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java
        Hide
        Hudson added a comment -

        Integrated in HBase-0.94 #84 (See https://builds.apache.org/job/HBase-0.94/84/)
        HBASE-5606 SplitLogManger async delete node hangs log splitting when ZK connection is lost
        (Prakash) (Revision 1309172)

        Result = FAILURE
        tedyu :
        Files :

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java
        Show
        Hudson added a comment - Integrated in HBase-0.94 #84 (See https://builds.apache.org/job/HBase-0.94/84/ ) HBASE-5606 SplitLogManger async delete node hangs log splitting when ZK connection is lost (Prakash) (Revision 1309172) Result = FAILURE tedyu : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java
        Hide
        Hudson added a comment -

        Integrated in HBase-0.92 #352 (See https://builds.apache.org/job/HBase-0.92/352/)
        HBASE-5606 SplitLogManger async delete node hangs log splitting when ZK connection is lost
        (Prakash) (Revision 1309171)

        Result = SUCCESS
        tedyu :
        Files :

        • /hbase/branches/0.92/CHANGES.txt
        • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java
        Show
        Hudson added a comment - Integrated in HBase-0.92 #352 (See https://builds.apache.org/job/HBase-0.92/352/ ) HBASE-5606 SplitLogManger async delete node hangs log splitting when ZK connection is lost (Prakash) (Revision 1309171) Result = SUCCESS tedyu : Files : /hbase/branches/0.92/CHANGES.txt /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java
        Hide
        Hudson added a comment -

        Integrated in HBase-0.94-security #7 (See https://builds.apache.org/job/HBase-0.94-security/7/)
        HBASE-5606 SplitLogManger async delete node hangs log splitting when ZK connection is lost
        (Prakash) (Revision 1309172)

        Result = SUCCESS
        tedyu :
        Files :

        • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java
        Show
        Hudson added a comment - Integrated in HBase-0.94-security #7 (See https://builds.apache.org/job/HBase-0.94-security/7/ ) HBASE-5606 SplitLogManger async delete node hangs log splitting when ZK connection is lost (Prakash) (Revision 1309172) Result = SUCCESS tedyu : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java
        Hide
        Hudson added a comment -

        Integrated in HBase-0.92-security #104 (See https://builds.apache.org/job/HBase-0.92-security/104/)
        HBASE-5606 SplitLogManger async delete node hangs log splitting when ZK connection is lost
        (Prakash) (Revision 1309171)

        Result = FAILURE
        tedyu :
        Files :

        • /hbase/branches/0.92/CHANGES.txt
        • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java
        Show
        Hudson added a comment - Integrated in HBase-0.92-security #104 (See https://builds.apache.org/job/HBase-0.92-security/104/ ) HBASE-5606 SplitLogManger async delete node hangs log splitting when ZK connection is lost (Prakash) (Revision 1309171) Result = FAILURE tedyu : Files : /hbase/branches/0.92/CHANGES.txt /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java
        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK-security #157 (See https://builds.apache.org/job/HBase-TRUNK-security/157/)
        HBASE-5606 SplitLogManger async delete node hangs log splitting when ZK connection is lost
        (Prakash) (Revision 1309173)

        Result = FAILURE
        tedyu :
        Files :

        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java
        Show
        Hudson added a comment - Integrated in HBase-TRUNK-security #157 (See https://builds.apache.org/job/HBase-TRUNK-security/157/ ) HBASE-5606 SplitLogManger async delete node hangs log splitting when ZK connection is lost (Prakash) (Revision 1309173) Result = FAILURE tedyu : Files : /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java

          People

          • Assignee:
            Prakash Khemani
            Reporter:
            Gopinathan A
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development