HBase
  1. HBase
  2. HBASE-8389

HBASE-8354 forces Namenode into loop with lease recovery requests

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.94.8
    • Component/s: None
    • Labels:
      None
    • Release Note:
      Hide
      A new config param, hbase.lease.recovery.waiting.period, is introduced for FSHDFSUtils#recoverFileLease().
      It represents the amount of time, in milliseconds, that we wait after issuing recoverLease() request. The default value is 4 seconds.
      Show
      A new config param, hbase.lease.recovery.waiting.period, is introduced for FSHDFSUtils#recoverFileLease(). It represents the amount of time, in milliseconds, that we wait after issuing recoverLease() request. The default value is 4 seconds.

      Description

      We ran hbase 0.94.3 patched with 8354 and observed too many outstanding lease recoveries because of the short retry interval of 1 second between lease recoveries.

      The namenode gets into the following loop:
      1) Receives lease recovery request and initiates recovery choosing a primary datanode every second
      2) A lease recovery is successful and the namenode tries to commit the block under recovery as finalized - this takes < 10 seconds in our environment since we run with tight HDFS socket timeouts.
      3) At step 2), there is a more recent recovery enqueued because of the aggressive retries. This causes the committed block to get preempted and we enter a vicious cycle

      So we do, <initiate_recovery> --> <commit_block> --> <commit_preempted_by_another_recovery>

      This loop is paused after 300 seconds which is the "hbase.lease.recovery.timeout". Hence the MTTR we are observing is 5 minutes which is terrible. Our ZK session timeout is 30 seconds and HDFS stale node detection timeout is 20 seconds.

      Note that before the patch, we do not call recoverLease so aggressively - also it seems that the HDFS namenode is pretty dumb in that it keeps initiating new recoveries for every call. Before the patch, we call recoverLease, assume that the block was recovered, try to get the file, it has zero length since its under recovery, we fail the task and retry until we get a non zero length. So things just work.

      Fixes:
      1) Expecting recovery to occur within 1 second is too aggressive. We need to have a more generous timeout. The timeout needs to be configurable since typically, the recovery takes as much time as the DFS timeouts. The primary datanode doing the recovery tries to reconcile the blocks and hits the timeouts when it tries to contact the dead node. So the recovery is as fast as the HDFS timeouts.

      2) We have another issue I report in HDFS 4721. The Namenode chooses the stale datanode to perform the recovery (since its still alive). Hence the first recovery request is bound to fail. So if we want a tight MTTR, we either need something like HDFS 4721 or we need something like this

      recoverLease(...)
      sleep(1000)
      recoverLease(...)
      sleep(configuredTimeout)
      recoverLease(...)
      sleep(configuredTimeout)

      Where configuredTimeout should be large enough to let the recovery happen but the first timeout is short so that we get past the moot recovery in step #1.

      1. nn.log
        121 kB
        Varun Sharma
      2. nn1.log
        84 kB
        Varun Sharma
      3. sample.patch
        1 kB
        Varun Sharma
      4. 8389-trunk-v1.txt
        1 kB
        Ted Yu
      5. 8389-trunk-v2.patch
        2 kB
        Varun Sharma
      6. 8389-0.94.txt
        1 kB
        Ted Yu
      7. 8389-0.94-v2.txt
        1 kB
        Ted Yu
      8. 8389-0.94-v3.txt
        1 kB
        Ted Yu
      9. 8389-0.94-v4.txt
        1 kB
        Ted Yu
      10. 8389-trunk-v2.txt
        4 kB
        Ted Yu
      11. 8389-0.94-v5.txt
        1 kB
        Ted Yu
      12. 8389-0.94-v6.txt
        4 kB
        Ted Yu
      13. 8389-trunk-v3.txt
        5 kB
        Ted Yu

        Issue Links

          Activity

          Hide
          Varun Sharma added a comment -

          Attached the Namenode logs showing a large number of recoveries in progress...

          1) nn.log - showing a huge number of initiated recoveries
          2) nn1.log - showing a huge number of block finalization/commit failures

          We ran hbase with an increased sleep period b/w recoveries - 25 seconds and the time to recovery came down substantially.

          Show
          Varun Sharma added a comment - Attached the Namenode logs showing a large number of recoveries in progress... 1) nn.log - showing a huge number of initiated recoveries 2) nn1.log - showing a huge number of block finalization/commit failures We ran hbase with an increased sleep period b/w recoveries - 25 seconds and the time to recovery came down substantially.
          Hide
          Varun Sharma added a comment -

          Attached sample patch which works in our setup. We run with dfs.socket.timeout=3000 and dfs.socket.write.timeout=5000 - so recovery typically takes < 20 seconds since there is 1 WAL and we recover only 1 block.

          If you closely observe the logs - the first useful recovery starts @49:05 since the first recovery chooses the dead datanode as the primary DN to do the recovery and first commit block is around 49:08 - hence, the recovery is finished within 3 seconds - this is the same as dfs.socket.timeout which is 3 seconds (the primary DN times out on the dead DN while trying to reconcile the replicas).

          I believe if we do not pick stale node replicas (> 20 second heart beat) as primary DN(s) and when we choose a non stale replica as the primary DN, we do not reconcile blocks against the stale replica, we can get the lease recovery to finish under 1 second. Currently that is not the case.

          Varun

          Show
          Varun Sharma added a comment - Attached sample patch which works in our setup. We run with dfs.socket.timeout=3000 and dfs.socket.write.timeout=5000 - so recovery typically takes < 20 seconds since there is 1 WAL and we recover only 1 block. If you closely observe the logs - the first useful recovery starts @49:05 since the first recovery chooses the dead datanode as the primary DN to do the recovery and first commit block is around 49:08 - hence, the recovery is finished within 3 seconds - this is the same as dfs.socket.timeout which is 3 seconds (the primary DN times out on the dead DN while trying to reconcile the replicas). I believe if we do not pick stale node replicas (> 20 second heart beat) as primary DN(s) and when we choose a non stale replica as the primary DN, we do not reconcile blocks against the stale replica, we can get the lease recovery to finish under 1 second. Currently that is not the case. Varun
          Hide
          Ted Yu added a comment -

          Patch for trunk where I introduced 'hbase.lease.recovery.retry.interval' with default value of 10 seconds.

          If DistributedFileSystem provides API that returns recovery status for given Path, we can utilize that to reduce the number of times we request lease recovery.

          Show
          Ted Yu added a comment - Patch for trunk where I introduced 'hbase.lease.recovery.retry.interval' with default value of 10 seconds. If DistributedFileSystem provides API that returns recovery status for given Path, we can utilize that to reduce the number of times we request lease recovery.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12579741/8389-trunk-v1.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 lineLengths. The patch does not introduce lines longer than 100

          +1 site. The mvn site goal succeeds with this patch.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/5374//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5374//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5374//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5374//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5374//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5374//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5374//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5374//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5374//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/5374//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12579741/8389-trunk-v1.txt against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 hadoop2.0 . The patch compiles against the hadoop 2.0 profile. +1 javadoc . The javadoc tool did not generate any warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 lineLengths . The patch does not introduce lines longer than 100 +1 site . The mvn site goal succeeds with this patch. -1 core tests . The patch failed these unit tests: org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/5374//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5374//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5374//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5374//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5374//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5374//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5374//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5374//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5374//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/5374//console This message is automatically generated.
          Hide
          Varun Sharma added a comment -

          Ted, thanks for the patch - I just attached a v2 with the corrected comments.

          Basically, there are two things here:
          a) The DDoS is independent of whether the namenode chose the stale data node as the primary DN to do the recovery. All it needs is a slower recovery time than the retry interval. Because then recoveries pile up faster than they actually complete. As a result, any recovery that succeeds gets preempted by a recovery that starts later. So it needs to be as big as the HDFS underlying timeout.
          b) The very first recoverLease call is always a no-op. In fact every third call is a no-op since the NN chooses DN1->DN2>DN3->DN1 in a cyclic order to do the recoveries. Note that DN1 is the dead datanode here.

          Currently, I think it will take 900 seconds for the cluster to recover if it accepts write traffic across all region servers.

          Varun

          Show
          Varun Sharma added a comment - Ted, thanks for the patch - I just attached a v2 with the corrected comments. Basically, there are two things here: a) The DDoS is independent of whether the namenode chose the stale data node as the primary DN to do the recovery. All it needs is a slower recovery time than the retry interval. Because then recoveries pile up faster than they actually complete. As a result, any recovery that succeeds gets preempted by a recovery that starts later. So it needs to be as big as the HDFS underlying timeout. b) The very first recoverLease call is always a no-op. In fact every third call is a no-op since the NN chooses DN1- >DN2 >DN3 ->DN1 in a cyclic order to do the recoveries. Note that DN1 is the dead datanode here. Currently, I think it will take 900 seconds for the cluster to recover if it accepts write traffic across all region servers. Varun
          Hide
          Lars Hofhansl added a comment -

          Making this critical. Either we fix this here or revert HBASE-8354.

          Show
          Lars Hofhansl added a comment - Making this critical. Either we fix this here or revert HBASE-8354 .
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12579752/8389-trunk-v2.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 lineLengths. The patch does not introduce lines longer than 100

          +1 site. The mvn site goal succeeds with this patch.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.TestFullLogReconstruction
          org.apache.hadoop.hbase.replication.TestReplicationQueueFailover
          org.apache.hadoop.hbase.security.access.TestAccessController

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/5375//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5375//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5375//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5375//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5375//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5375//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5375//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5375//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5375//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/5375//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12579752/8389-trunk-v2.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 hadoop2.0 . The patch compiles against the hadoop 2.0 profile. +1 javadoc . The javadoc tool did not generate any warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 lineLengths . The patch does not introduce lines longer than 100 +1 site . The mvn site goal succeeds with this patch. -1 core tests . The patch failed these unit tests: org.apache.hadoop.hbase.TestFullLogReconstruction org.apache.hadoop.hbase.replication.TestReplicationQueueFailover org.apache.hadoop.hbase.security.access.TestAccessController Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/5375//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5375//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5375//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5375//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5375//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5375//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5375//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5375//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5375//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/5375//console This message is automatically generated.
          Hide
          Varun Sharma added a comment -

          I believe the test failed because i increased the default retry interval in patch v2.

          I think before HBASE 8354 - we essentially have the following situation:

          1) Split task picked up by region server - recover lease called
          2) Ignore return value
          3) Try to read file and get a file length=0 and sometimes try to grab the 0 length file from a DN
          4) Mostly fail because of 0 length file or because the DN has not "finalized" the block and its under recovery
          5) Task unassigned and bounce back and forth b/w multiple region servers (I see multiple region servers holdign the same task sometimes)

          This process is equally bad - multi minute recovery (not sure exactly how long)

          Show
          Varun Sharma added a comment - I believe the test failed because i increased the default retry interval in patch v2. I think before HBASE 8354 - we essentially have the following situation: 1) Split task picked up by region server - recover lease called 2) Ignore return value 3) Try to read file and get a file length=0 and sometimes try to grab the 0 length file from a DN 4) Mostly fail because of 0 length file or because the DN has not "finalized" the block and its under recovery 5) Task unassigned and bounce back and forth b/w multiple region servers (I see multiple region servers holdign the same task sometimes) This process is equally bad - multi minute recovery (not sure exactly how long)
          Hide
          Ted Yu added a comment -

          I am reading https://issues.apache.org/jira/secure/attachment/12445209/appendDesign3.pdf as well as related code in FSNamesystem so that I gain better understanding.
          I am also in contact with hdfs developers about this matter.

          If no satisfactory solution is found Monday, I will revert HBASE-8354.

          Show
          Ted Yu added a comment - I am reading https://issues.apache.org/jira/secure/attachment/12445209/appendDesign3.pdf as well as related code in FSNamesystem so that I gain better understanding. I am also in contact with hdfs developers about this matter. If no satisfactory solution is found Monday, I will revert HBASE-8354 .
          Hide
          Varun Sharma added a comment -

          Hi Ted,

          Seems like lease recovery is the real thorn when it comes to recovery for HBase. The stale node detection patches work very well for splitting the finalized WAL but not the WAL being currently written into. I basically see a very long time to recovery because it always takes a long time for HDFS with the stock timeouts.

          I am trying out a patch for HDFS 4721 which basically avoids all those datanodes which have not heart beated for 20 seconds during block recovery. That seems to enable me to recover the block within 1 second. With that fix, we can survive loss of a single datanode and recover the last WAL within 2-3 seconds.

          I have not heard from the HDFS community on it yet but I think once we lose a datanode - that should not be chosen as the Primary Datanode for lease recovery nor for block reconciliation...

          Show
          Varun Sharma added a comment - Hi Ted, Seems like lease recovery is the real thorn when it comes to recovery for HBase. The stale node detection patches work very well for splitting the finalized WAL but not the WAL being currently written into. I basically see a very long time to recovery because it always takes a long time for HDFS with the stock timeouts. I am trying out a patch for HDFS 4721 which basically avoids all those datanodes which have not heart beated for 20 seconds during block recovery. That seems to enable me to recover the block within 1 second. With that fix, we can survive loss of a single datanode and recover the last WAL within 2-3 seconds. I have not heard from the HDFS community on it yet but I think once we lose a datanode - that should not be chosen as the Primary Datanode for lease recovery nor for block reconciliation...
          Hide
          Ted Yu added a comment -

          Thanks for the continued effort, Varun.

          Please post your findings on HDFS-4721 as well.

          Show
          Ted Yu added a comment - Thanks for the continued effort, Varun. Please post your findings on HDFS-4721 as well.
          Hide
          Varun Sharma added a comment -

          Cool - I attached a rough patch with some logs to show how it allows lease recovery within 1-2 seconds. Now waiting for input from the HDFS community.

          Do we know what the status was w.r.t to lease recovery before HBASE 8354 - we call recoverLease and access the file - mostly error out and then do we retry or do we unassign the zk task for the split ?

          Varun

          Show
          Varun Sharma added a comment - Cool - I attached a rough patch with some logs to show how it allows lease recovery within 1-2 seconds. Now waiting for input from the HDFS community. Do we know what the status was w.r.t to lease recovery before HBASE 8354 - we call recoverLease and access the file - mostly error out and then do we retry or do we unassign the zk task for the split ? Varun
          Hide
          Liang Xie added a comment -

          w/o HBASE-8354, it may lead to data loss, please refer to HBASE-7878 for detail info, Varun Sharma

          Show
          Liang Xie added a comment - w/o HBASE-8354 , it may lead to data loss, please refer to HBASE-7878 for detail info, Varun Sharma
          Hide
          Ted Yu added a comment -

          Patch for 0.94
          Lease recovery retry is turned off by default.

          Show
          Ted Yu added a comment - Patch for 0.94 Lease recovery retry is turned off by default.
          Hide
          Varun Sharma added a comment -

          Hi Ted,

          +1 on patch for 0.94

          This basically retains the old behaviour prior to hbase 7878. For people who run clusters with tight hdfs timeouts and know that lease/block recovery for them shall occur within x seconds, they can make use of the configurable retry interval to wait for the real recovery to happen.

          I feel that HBase 7878 does not cause data loss for hbase but I am not sure. My theory is that for HBase, size of 1 WAL is < size of one HDFS block before it is rolled. Hence each rolled WAL has one block and one on which the lease is being held contains 1 block under_recovery/under_construction - that file has a size=0 in the namenode until the lease recovery is complete - this is because there are no finalized blocks for the WAL. So, when we do the following, try to replay the WAL without lease recovery being complete, we get a file of size 0 from the namenode.

          From the region server logs, it seems that we do not take size 0 for the file as the truth but instead treat it as a failure until we get a WAL file size > 0.

          However, if the size of the WAL is > HDFS block size then it is possible that some HDFS blocks have been finalized and we get a file size >= 0 because of the finalized blocks. In that case we could end up throwing away the last block belonging to the WAL. Maybe that is why this was observed for accumulo but not for HBase.

          Show
          Varun Sharma added a comment - Hi Ted, +1 on patch for 0.94 This basically retains the old behaviour prior to hbase 7878. For people who run clusters with tight hdfs timeouts and know that lease/block recovery for them shall occur within x seconds, they can make use of the configurable retry interval to wait for the real recovery to happen. I feel that HBase 7878 does not cause data loss for hbase but I am not sure. My theory is that for HBase, size of 1 WAL is < size of one HDFS block before it is rolled. Hence each rolled WAL has one block and one on which the lease is being held contains 1 block under_recovery/under_construction - that file has a size=0 in the namenode until the lease recovery is complete - this is because there are no finalized blocks for the WAL. So, when we do the following, try to replay the WAL without lease recovery being complete, we get a file of size 0 from the namenode. From the region server logs, it seems that we do not take size 0 for the file as the truth but instead treat it as a failure until we get a WAL file size > 0. However, if the size of the WAL is > HDFS block size then it is possible that some HDFS blocks have been finalized and we get a file size >= 0 because of the finalized blocks. In that case we could end up throwing away the last block belonging to the WAL. Maybe that is why this was observed for accumulo but not for HBase.
          Hide
          Nicolas Liochon added a comment -

          The error scenario is a recovery occurring while the region server is still up & writing: the recovery could be finished (wal file read) while the not-so-dead-regionserver will continue writing, so the last writes will be lost.

          From the region server logs, it seems that we do not take size 0 for the file as the truth but instead treat it as a failure until we get a WAL file size > 0.

          What logs are you referring to?

          Show
          Nicolas Liochon added a comment - The error scenario is a recovery occurring while the region server is still up & writing: the recovery could be finished (wal file read) while the not-so-dead-regionserver will continue writing, so the last writes will be lost. From the region server logs, it seems that we do not take size 0 for the file as the truth but instead treat it as a failure until we get a WAL file size > 0. What logs are you referring to?
          Hide
          Nicolas Liochon added a comment -

          In 0.94, before 7878, we were waiting 1s after the releaseFileLease. May be it was a bug, but it was buying some time for the regionserver to abort. May be we could just increase this wait time to something like 5s. We need a proper fix for trunk, but I fear we will need something from hdfs for this. Increasing the sleep would limit the risk. We can be this configurable of course.

          boolean recover = hdfs.recoverLease()
          if (!recover ) // HDFS has started an asynchronous lease recovery.
          sleep 5s // hope that it buys enough time for the RS to stop writing

          wdyt?

          Show
          Nicolas Liochon added a comment - In 0.94, before 7878, we were waiting 1s after the releaseFileLease. May be it was a bug, but it was buying some time for the regionserver to abort. May be we could just increase this wait time to something like 5s. We need a proper fix for trunk, but I fear we will need something from hdfs for this. Increasing the sleep would limit the risk. We can be this configurable of course. boolean recover = hdfs.recoverLease() if (!recover ) // HDFS has started an asynchronous lease recovery. sleep 5s // hope that it buys enough time for the RS to stop writing wdyt?
          Hide
          Ted Yu added a comment -

          I plan to rename the new config parameter hbase.lease.recovery.interval
          This is in anticipation of new hdfs API which allows us to query lease recovery progress.
          For 0.94, the retry would be off. We wait hbase.lease.recovery.interval before returning.

          Show
          Ted Yu added a comment - I plan to rename the new config parameter hbase.lease.recovery.interval This is in anticipation of new hdfs API which allows us to query lease recovery progress. For 0.94, the retry would be off. We wait hbase.lease.recovery.interval before returning.
          Hide
          Nicolas Liochon added a comment -

          Ted Yu Ok for me. What's the HDFS jira?

          Show
          Nicolas Liochon added a comment - Ted Yu Ok for me. What's the HDFS jira?
          Hide
          ramkrishna.s.vasudevan added a comment -

          We need a proper fix for trunk, but I fear we will need something from hdfs for this.

          I think we the original parent JIRA was to make HDFS to provide something so that HBase knows the status of the recovery. I think that would be a more cleaner way to solve this.

          Show
          ramkrishna.s.vasudevan added a comment - We need a proper fix for trunk, but I fear we will need something from hdfs for this. I think we the original parent JIRA was to make HDFS to provide something so that HBase knows the status of the recovery. I think that would be a more cleaner way to solve this.
          Hide
          Nicolas Liochon added a comment -

          I think that would be a more cleaner way to solve this.

          Yes, I agree. There is a JIRA to make it faster as well (HDFS-4721). I wonder if it's possible to make synchronous, I expect the answer to be no, but the tradeoff is not that simple. As well, the lease removal could be started by the master, before starting the recovery). If there are a lot of file to read, it would be faster.

          Show
          Nicolas Liochon added a comment - I think that would be a more cleaner way to solve this. Yes, I agree. There is a JIRA to make it faster as well ( HDFS-4721 ). I wonder if it's possible to make synchronous, I expect the answer to be no, but the tradeoff is not that simple. As well, the lease removal could be started by the master, before starting the recovery). If there are a lot of file to read, it would be faster.
          Hide
          Ted Yu added a comment -

          Patch v2 renames the config param as "hbase.lease.recovery.interval" with default of 2.5 seconds.
          Test suite run for 8389-0.94.txt was green:

          Tests run: 1341, Failures: 0, Errors: 0, Skipped: 13
          
          [INFO] ------------------------------------------------------------------------
          [INFO] BUILD SUCCESS
          
          Show
          Ted Yu added a comment - Patch v2 renames the config param as "hbase.lease.recovery.interval" with default of 2.5 seconds. Test suite run for 8389-0.94.txt was green: Tests run: 1341, Failures: 0, Errors: 0, Skipped: 13 [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS
          Hide
          Nicolas Liochon added a comment -

          Seems it's ok. I think 5s is much safer (empirically, it seems the NN needs ~4s when it can speak to the datanode: this would cover the case of dead regionserver but live datanode.

          It's worth a comment and a log line imho.
          "LOG.info("Finished lease recover attempt for " + p);" could be changed.

          Show
          Nicolas Liochon added a comment - Seems it's ok. I think 5s is much safer (empirically, it seems the NN needs ~4s when it can speak to the datanode: this would cover the case of dead regionserver but live datanode. It's worth a comment and a log line imho. "LOG.info("Finished lease recover attempt for " + p);" could be changed.
          Hide
          Ted Yu added a comment -

          Patch v3 increases default wait time to 4 seconds.

          From https://issues.apache.org/jira/browse/HDFS-4721?focusedCommentId=13637698&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13637698:

          Earlier - recovery takes as long as dfs.socket.timeout but now it takes roughly 1-2 seconds (which is basically the heartbeat interval).

          Show
          Ted Yu added a comment - Patch v3 increases default wait time to 4 seconds. From https://issues.apache.org/jira/browse/HDFS-4721?focusedCommentId=13637698&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13637698: Earlier - recovery takes as long as dfs.socket.timeout but now it takes roughly 1-2 seconds (which is basically the heartbeat interval).
          Hide
          Nicolas Liochon added a comment -

          Ok for 4 seconds.
          You don't want to add a comment for the "break"?

          Show
          Nicolas Liochon added a comment - Ok for 4 seconds. You don't want to add a comment for the "break"?
          Hide
          Nicolas Liochon added a comment -

          +1, thanks Ted.

          Show
          Nicolas Liochon added a comment - +1, thanks Ted.
          Hide
          Ted Yu added a comment - - edited

          In HDFS-4525, the following API was added:

            /**
             * Get the close status of a file
             * @param src The path to the file
             *
             * @return return true if file is closed
             * @throws FileNotFoundException if the file does not exist.
             * @throws IOException If an I/O error occurred     
             */
            public boolean isFileClosed(Path src) throws IOException {
          

          Since it is not available in hadoop 1.1, I will utilize it, through reflection, for 0.95 / trunk.

          A new HBASE JIRA would be logged for the above improvement.

          Show
          Ted Yu added a comment - - edited In HDFS-4525 , the following API was added: /** * Get the close status of a file * @param src The path to the file * * @ return return true if file is closed * @ throws FileNotFoundException if the file does not exist. * @ throws IOException If an I/O error occurred */ public boolean isFileClosed(Path src) throws IOException { Since it is not available in hadoop 1.1, I will utilize it, through reflection, for 0.95 / trunk. A new HBASE JIRA would be logged for the above improvement.
          Hide
          Ted Yu added a comment -

          Lars Hofhansl:
          What do you think of 8389-0.94-v4.txt ?

          Show
          Ted Yu added a comment - Lars Hofhansl : What do you think of 8389-0.94-v4.txt ?
          Hide
          Ted Yu added a comment -

          Patch for trunk utilizes isFileClosed() API to check for whether file is closed.

          I moved the call to recoverLease() out of the loop because we don't need to call it a second time.

          Reflection is used for isFileClosed().

          TestHLogSplit and TestHLog passed.

          Show
          Ted Yu added a comment - Patch for trunk utilizes isFileClosed() API to check for whether file is closed. I moved the call to recoverLease() out of the loop because we don't need to call it a second time. Reflection is used for isFileClosed(). TestHLogSplit and TestHLog passed.
          Hide
          Enis Soztutar added a comment -

          With this patch, still Varun's observation above about the WAL's having 2 or more blocks would still cause data loss. Can we do exponential backoff at the recoverLease() retries?
          Other than HDFS-4721 and HDFS-4724, should we log another hdfs jira for the recoverLease() call not retrying on outstanding recovery attempts?

          Show
          Enis Soztutar added a comment - With this patch, still Varun's observation above about the WAL's having 2 or more blocks would still cause data loss. Can we do exponential backoff at the recoverLease() retries? Other than HDFS-4721 and HDFS-4724 , should we log another hdfs jira for the recoverLease() call not retrying on outstanding recovery attempts?
          Hide
          Ted Yu added a comment -

          Can we do exponential backoff at the recoverLease() retries?

          There is a balance we try to achieve here: reasonably fast recovery without data loss. Since isFileClosed() is absent in hdfs 1.0 /1.1, we don't have a reliable way of knowing when to return from FSHDFSUtils#recoverFileLease().
          So for 0.94, I prefer to restore previous behavior.

          For trunk, we can continue discussion to achieve the best solution (in another JIRA, if needed).

          Show
          Ted Yu added a comment - Can we do exponential backoff at the recoverLease() retries? There is a balance we try to achieve here: reasonably fast recovery without data loss. Since isFileClosed() is absent in hdfs 1.0 /1.1, we don't have a reliable way of knowing when to return from FSHDFSUtils#recoverFileLease(). So for 0.94, I prefer to restore previous behavior. For trunk, we can continue discussion to achieve the best solution (in another JIRA, if needed).
          Hide
          Nicolas Liochon added a comment -

          For trunk:

          • This should not be in compat?
          • Catching raw Exception is dangerous imho.
          • If I understand well, in trunk with HDFS 1.x we do as in 0.94: we sleep to reduce the probability of a dataloss (I don't have a better solution). Then we should log it explicitly, and wait for a few seconds (4).
          Show
          Nicolas Liochon added a comment - For trunk: This should not be in compat? Catching raw Exception is dangerous imho. If I understand well, in trunk with HDFS 1.x we do as in 0.94: we sleep to reduce the probability of a dataloss (I don't have a better solution). Then we should log it explicitly, and wait for a few seconds (4).
          Hide
          Lars Hofhansl added a comment -
              this.blocksize = conf.getLong("hbase.regionserver.hlog.blocksize",
                  getDefaultBlockSize());
              // Roll at 95% of block size.
              float multi = conf.getFloat("hbase.regionserver.logroll.multiplier", 0.95f);
              this.logrollsize = (long)(this.blocksize * multi);
          

          It seems we're trying to keep the WAL file size < than the HDFS block size by default.

          Ted Yu Patch looks good. I'd be happier if we increase the interval to 5s, though. If I understand the discussion correctly we'll pile up recovery request if that interval is less than the actual time it takes to do the recovery at the NN. Losing 1s in return for safety seems acceptable (I'm also fine with 4s, if you feel that would be better).

          Show
          Lars Hofhansl added a comment - this .blocksize = conf.getLong( "hbase.regionserver.hlog.blocksize" , getDefaultBlockSize()); // Roll at 95% of block size. float multi = conf.getFloat( "hbase.regionserver.logroll.multiplier" , 0.95f); this .logrollsize = ( long )( this .blocksize * multi); It seems we're trying to keep the WAL file size < than the HDFS block size by default. Ted Yu Patch looks good. I'd be happier if we increase the interval to 5s, though. If I understand the discussion correctly we'll pile up recovery request if that interval is less than the actual time it takes to do the recovery at the NN. Losing 1s in return for safety seems acceptable (I'm also fine with 4s, if you feel that would be better).
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12579863/8389-trunk-v2.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 lineLengths. The patch does not introduce lines longer than 100

          +1 site. The mvn site goal succeeds with this patch.

          +1 core tests. The patch passed unit tests in .

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/5383//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5383//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5383//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5383//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5383//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5383//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5383//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5383//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5383//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/5383//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12579863/8389-trunk-v2.txt against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 hadoop2.0 . The patch compiles against the hadoop 2.0 profile. +1 javadoc . The javadoc tool did not generate any warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 lineLengths . The patch does not introduce lines longer than 100 +1 site . The mvn site goal succeeds with this patch. +1 core tests . The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/5383//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5383//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5383//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5383//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5383//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5383//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5383//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5383//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5383//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/5383//console This message is automatically generated.
          Hide
          Ted Yu added a comment -

          I updated Release Notes.

          I prefer to keep the wait time at 4 seconds since HLog would be rolled at 95% of block size.

          Show
          Ted Yu added a comment - I updated Release Notes. I prefer to keep the wait time at 4 seconds since HLog would be rolled at 95% of block size.
          Hide
          Ted Yu added a comment - - edited

          Revisied Release Notes to be consistent with patch v5.

          hbase.lease.recovery.interval would be used in trunk patch. It represents the interval between successive calls to isFileClosed(). I changed the config param in 0.94 patch to hbase.lease.recovery.waiting.period

          Trunk patch would go through longer review / testing cycle - I will create a new JIRA once patch for 0.94 is integrated.

          Show
          Ted Yu added a comment - - edited Revisied Release Notes to be consistent with patch v5. hbase.lease.recovery.interval would be used in trunk patch. It represents the interval between successive calls to isFileClosed(). I changed the config param in 0.94 patch to hbase.lease.recovery.waiting.period Trunk patch would go through longer review / testing cycle - I will create a new JIRA once patch for 0.94 is integrated.
          Hide
          Ted Yu added a comment -

          Patch v6 for 0.94 detects whether hdfs supports isFileClosed() and uses it in the loop if the method is available.

          Show
          Ted Yu added a comment - Patch v6 for 0.94 detects whether hdfs supports isFileClosed() and uses it in the loop if the method is available.
          Hide
          Ted Yu added a comment -

          Trunk patch v3 adjusts the waiting period according to whether isFileClosed is supported or not.

          Show
          Ted Yu added a comment - Trunk patch v3 adjusts the waiting period according to whether isFileClosed is supported or not.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12579899/8389-trunk-v3.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 lineLengths. The patch does not introduce lines longer than 100

          +1 site. The mvn site goal succeeds with this patch.

          +1 core tests. The patch passed unit tests in .

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/5384//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5384//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5384//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5384//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5384//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5384//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5384//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5384//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5384//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/5384//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12579899/8389-trunk-v3.txt against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 hadoop2.0 . The patch compiles against the hadoop 2.0 profile. +1 javadoc . The javadoc tool did not generate any warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 lineLengths . The patch does not introduce lines longer than 100 +1 site . The mvn site goal succeeds with this patch. +1 core tests . The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/5384//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5384//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5384//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5384//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5384//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5384//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5384//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5384//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5384//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/5384//console This message is automatically generated.
          Hide
          Lars Hofhansl added a comment -

          Wait, so 0.94 we're only calling isFileClosed() in a loop but not recoverLease(...)?

          Show
          Lars Hofhansl added a comment - Wait, so 0.94 we're only calling isFileClosed() in a loop but not recoverLease(...)?
          Hide
          Ted Yu added a comment -

          recoverLease() should not be called repeatedly. hdfs would take care of lease recovery.
          8389-0.94-v6.txt is same as 8389-0.94-v5.txt when isFileClosed is not supported.
          Otherwise we poll DistributedFileSystem for the completion of lease recovery.

          Show
          Ted Yu added a comment - recoverLease() should not be called repeatedly. hdfs would take care of lease recovery. 8389-0.94-v6.txt is same as 8389-0.94-v5.txt when isFileClosed is not supported. Otherwise we poll DistributedFileSystem for the completion of lease recovery.
          Hide
          Varun Sharma added a comment -

          My feeling is that this will not truly work because we will get stuck.

          We call recoverLease - lease recovery gets added to primary datanode which is already dead. Now, we keep calling isClosed() but hte file never closes since the lease recovery does not really start (unless we have something like HDFS 4721).

          Eventually, I suspect there is a timeout for how long HLog tasks can be outstanding.

          Varun

          Show
          Varun Sharma added a comment - My feeling is that this will not truly work because we will get stuck. We call recoverLease - lease recovery gets added to primary datanode which is already dead. Now, we keep calling isClosed() but hte file never closes since the lease recovery does not really start (unless we have something like HDFS 4721). Eventually, I suspect there is a timeout for how long HLog tasks can be outstanding. Varun
          Hide
          Nicolas Liochon added a comment -

          I think that for 0.94.8, we should stick to the version that acts like before except the increased sleep from 1s to 4s. That's Ted"s version v5 if I'm not wrong. This version lowers the probability to have a dataloss on false positive regionserver timeout, at a very low cost on mttr.

          For 0.95, we need to test the fixes. By test, I mean unplugging a computer to see what is the mttr. In the tests done here, it's over 10 minutes (more tests to come tomorrow, but it's repeatable)

          This will take a little bit longer, but given the complexity and the criticality, it's our best bet imho...

          Show
          Nicolas Liochon added a comment - I think that for 0.94.8, we should stick to the version that acts like before except the increased sleep from 1s to 4s. That's Ted"s version v5 if I'm not wrong. This version lowers the probability to have a dataloss on false positive regionserver timeout, at a very low cost on mttr. For 0.95, we need to test the fixes. By test, I mean unplugging a computer to see what is the mttr. In the tests done here, it's over 10 minutes (more tests to come tomorrow, but it's repeatable) This will take a little bit longer, but given the complexity and the criticality, it's our best bet imho...
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > We call recoverLease - lease recovery gets added to primary datanode which is already dead. Now, we keep calling isClosed() but hte file never closes since the lease recovery does not really start ...

          You are right that client calls recoverLease to Namenode and then Namenode will choose a live datanode as the primary for recovery. However, if the primary datanode indeed is dead, a second lease recovery won't be started until the lease expires. This will take a long time unless we have HDFS-4721.

          Show
          Tsz Wo Nicholas Sze added a comment - > We call recoverLease - lease recovery gets added to primary datanode which is already dead. Now, we keep calling isClosed() but hte file never closes since the lease recovery does not really start ... You are right that client calls recoverLease to Namenode and then Namenode will choose a live datanode as the primary for recovery. However, if the primary datanode indeed is dead, a second lease recovery won't be started until the lease expires. This will take a long time unless we have HDFS-4721 .
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > ... a second lease recovery won't be started ...

          I forgot to say that the second lease recovery will be started automatically by the namenode again but it has to wait for LEASE_HARDLIMIT_PERIOD, which is one hour by default. If HBase wants to speed up the second recovery, it may wait and check file close for some reasonable time period, say 1 minute, and then call recoverLease again.

          Show
          Tsz Wo Nicholas Sze added a comment - > ... a second lease recovery won't be started ... I forgot to say that the second lease recovery will be started automatically by the namenode again but it has to wait for LEASE_HARDLIMIT_PERIOD, which is one hour by default. If HBase wants to speed up the second recovery, it may wait and check file close for some reasonable time period, say 1 minute, and then call recoverLease again.
          Hide
          Ted Yu added a comment -

          According to Nicolas Sze and Nicolas Liochon's comments above, I would integrate patch v5 to 0.94 later today and open new JIRA for trunk.

          Show
          Ted Yu added a comment - According to Nicolas Sze and Nicolas Liochon's comments above, I would integrate patch v5 to 0.94 later today and open new JIRA for trunk.
          Hide
          Varun Sharma added a comment -

          I think "isFileClosed()" loop basically assumes that recoverLease() has started the recovery. But it is quite possible that recoverLease() has never started the recovery. So we need to be a little more careful with the interactions b/w recoverLease and isFileClosed().

          For now we can perhaps stick to v5. I think the behaviour there would be, for a cluster run with default settings:
          a) recoverLease every 4 seconds
          b) HLog split timeout expires
          c) Let the task bounce back and forth b/w region servers

          The only way to fix this is to configure the HDFS cluster in a way such that lease recovery finishes within 4 seconds (like dfs.socket.timeout=3000 and connect timeouts to be low enough). I am going to test out some of these combinations today on 0.94

          Thanks
          Varun

          Show
          Varun Sharma added a comment - I think "isFileClosed()" loop basically assumes that recoverLease() has started the recovery. But it is quite possible that recoverLease() has never started the recovery. So we need to be a little more careful with the interactions b/w recoverLease and isFileClosed(). For now we can perhaps stick to v5. I think the behaviour there would be, for a cluster run with default settings: a) recoverLease every 4 seconds b) HLog split timeout expires c) Let the task bounce back and forth b/w region servers The only way to fix this is to configure the HDFS cluster in a way such that lease recovery finishes within 4 seconds (like dfs.socket.timeout=3000 and connect timeouts to be low enough). I am going to test out some of these combinations today on 0.94 Thanks Varun
          Hide
          Enis Soztutar added a comment -

          For Hadoop 2, I think it makes sense to do smt to this affect:

          bool closed = false;
          while (!closed) {
           closed = recoverLease();
           while !closed and timeout 1 min
             sleep(4 sec)
             closed = isFileClosed()
          }
          

          For 0.94 and trunk running on Hadoop 1, we can do smt like:

          bool closed = false;
          int i = 0;
          while (!closed) {
           closed = recoverLease();
           if (closed) 
             return;
           sleep Min(Pow(2, i), 60) sec
           i++;
          }
          
          
          Show
          Enis Soztutar added a comment - For Hadoop 2, I think it makes sense to do smt to this affect: bool closed = false ; while (!closed) { closed = recoverLease(); while !closed and timeout 1 min sleep(4 sec) closed = isFileClosed() } For 0.94 and trunk running on Hadoop 1, we can do smt like: bool closed = false ; int i = 0; while (!closed) { closed = recoverLease(); if (closed) return ; sleep Min(Pow(2, i), 60) sec i++; }
          Hide
          Ted Yu added a comment -

          For Hadoop 2, I agree with the plan above. I would make 1 minute configurable.

          For Hadoop 1.0 / 1.1, a lot of hdfs optimizations are absent w.r.t. stale Data Node handling. I feel it is hard for optimization in HBase to achieve desirable effect under dynamic cluster config.
          So I prefer patch v5 for 0.94

          Show
          Ted Yu added a comment - For Hadoop 2, I agree with the plan above. I would make 1 minute configurable. For Hadoop 1.0 / 1.1, a lot of hdfs optimizations are absent w.r.t. stale Data Node handling. I feel it is hard for optimization in HBase to achieve desirable effect under dynamic cluster config. So I prefer patch v5 for 0.94
          Hide
          Lars Hofhansl added a comment -

          Let's start with v5 for 0.94. We can always improve later.
          If there is other configuration needed on the HDFS side (as Varun indicates) we should document this somewhere (at the very least in the release notes for this issue).

          Show
          Lars Hofhansl added a comment - Let's start with v5 for 0.94. We can always improve later. If there is other configuration needed on the HDFS side (as Varun indicates) we should document this somewhere (at the very least in the release notes for this issue).
          Hide
          Ted Yu added a comment -

          Integrated to 0.94

          Thanks for the initial patch, Varun.

          Thanks for the reviews, Lars, Varun, Nicolas and Enis.

          Let's continue optimization work in HBASE-8394.

          Show
          Ted Yu added a comment - Integrated to 0.94 Thanks for the initial patch, Varun. Thanks for the reviews, Lars, Varun, Nicolas and Enis. Let's continue optimization work in HBASE-8394 .
          Hide
          stack added a comment -

          I think this patch is intellectual wanking without tests or experience on a cluster where the issue is replicated. I wouldn't mind but it seems like Varun even volunteered to try out the last version of patch but the patch was committed w/o waiting on his experience. Would suggest revert until we hear back about Varun's effort especially since the man was the one to open the issue and was the person who figured out root cause.

          Show
          stack added a comment - I think this patch is intellectual wanking without tests or experience on a cluster where the issue is replicated. I wouldn't mind but it seems like Varun even volunteered to try out the last version of patch but the patch was committed w/o waiting on his experience. Would suggest revert until we hear back about Varun's effort especially since the man was the one to open the issue and was the person who figured out root cause.
          Hide
          Lars Hofhansl added a comment -

          v5 doesn't do any harm.
          I'm fine with reverting from 0.94, but only if we also revert HBASE-8354 until this is worked out.

          Show
          Lars Hofhansl added a comment - v5 doesn't do any harm. I'm fine with reverting from 0.94, but only if we also revert HBASE-8354 until this is worked out.
          Hide
          Hudson added a comment -

          Integrated in HBase-0.94 #964 (See https://builds.apache.org/job/HBase-0.94/964/)
          HBASE-8389 HBASE-8354 DDoSes Namenode with lease recovery requests (Varun and Ted) (Revision 1470800)

          Result = SUCCESS
          tedyu :
          Files :

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/util/FSHDFSUtils.java
          Show
          Hudson added a comment - Integrated in HBase-0.94 #964 (See https://builds.apache.org/job/HBase-0.94/964/ ) HBASE-8389 HBASE-8354 DDoSes Namenode with lease recovery requests (Varun and Ted) (Revision 1470800) Result = SUCCESS tedyu : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/util/FSHDFSUtils.java
          Hide
          Eric Newton added a comment -

          Hi, accumulo developer here. I opened HBASE-7878 based on our experience following HBase's approach to log recovery.

          From our observations:

          • we have never seen the DDoS of the NN while testing on Hadoop 1.0.4 and 2.0.3-alpha
          • accumulo tries to keep the WALs in one HDFS block, too.
          • before checking the return code of recoverLease, we had occasional data loss.
            This was with tests designed to find data loss, while randomly killing servers every few minutes. These tests would run for hours before detecting data loss. We were able to attribute data loss to incomplete log recovery.

          accumulo's approach to recovery:

          • it waits a configurable period after the loss of a tablet (region) server lock (default: 10s).
            This gives the server a chance to die, and for zookeeper to propagate agreement about the lock to the other servers.
          • it calls recoverLease on the WALs until it returns true.
          • due to the discussion in this issue (HBASE-8354), I opened (ACCUMULO-1328) to make the wait interval between calls to recoverLease configurable: it was hard coded to 1 second.
          Show
          Eric Newton added a comment - Hi, accumulo developer here. I opened HBASE-7878 based on our experience following HBase's approach to log recovery. From our observations: we have never seen the DDoS of the NN while testing on Hadoop 1.0.4 and 2.0.3-alpha accumulo tries to keep the WALs in one HDFS block, too. before checking the return code of recoverLease, we had occasional data loss. This was with tests designed to find data loss, while randomly killing servers every few minutes. These tests would run for hours before detecting data loss. We were able to attribute data loss to incomplete log recovery. accumulo's approach to recovery: it waits a configurable period after the loss of a tablet (region) server lock (default: 10s). This gives the server a chance to die, and for zookeeper to propagate agreement about the lock to the other servers. it calls recoverLease on the WALs until it returns true. due to the discussion in this issue ( HBASE-8354 ), I opened ( ACCUMULO-1328 ) to make the wait interval between calls to recoverLease configurable: it was hard coded to 1 second.
          Hide
          stack added a comment -

          Eric Newton Thanks for the input boss

          Show
          stack added a comment - Eric Newton Thanks for the input boss
          Hide
          Varun Sharma added a comment -

          Thank a lot for chiming in, Eric.

          Its great to know that Accumulo also uses single WAL block - so indeed, 7878 is needed for HBase.

          I may have used a strong word DDos here - it does not mean the NN experiences high CPU, high load or high network. Its just that it enters a vicious cycle of block recoveries (on 2.0.0-alpha)

          Varun

          Show
          Varun Sharma added a comment - Thank a lot for chiming in, Eric. Its great to know that Accumulo also uses single WAL block - so indeed, 7878 is needed for HBase. I may have used a strong word DDos here - it does not mean the NN experiences high CPU, high load or high network. Its just that it enters a vicious cycle of block recoveries (on 2.0.0-alpha) Varun
          Hide
          Varun Sharma added a comment -

          FYI,

          I am going to test the v5 of this patch today and report back... Thanks !

          Show
          Varun Sharma added a comment - FYI, I am going to test the v5 of this patch today and report back... Thanks !
          Hide
          Varun Sharma added a comment -

          Okay I did some testing with v5 and the MTTR was pretty good - 2-3 minutes - log splitting took 20-30 seconds - basically around 1 minute of this time was to replay the edits from the recovered_edits. This was with the stale node patches (3703 and 3912) however for HDFS. Also had tight dfs.socket.timeout=30. I mostly, suspended the Datanode and the region server packages at the same time. I also ran a test where I used iptables to firewall against all traffic to the host except "ssh" traffic.

          However, the weird thing was, I also tried to reproduce the failure scenario above, with is setting the timeout at 1 second and I could not. I looked into the NN logs and this is what happened. Lease recovery was called the 1st time and a block recovery was initiated with the dead datanode (no HDFS 4721). Lease recovery was called the 2nd time and it returned true almost every time I ran these tests.

          This is something that I did not see, the last time around. The logs I attached above show that a release recovery is called once by one SplitLogWorker, followed by 25 calls by another worker, followed by another 25 and eventually hundreds of calls the 3rd time. The 25 calls make sense since each split worker has a task level timeout of 25 seconds and we do recoverLease every second. Also there are 3 resubmissions, so the last worker is trying to get back the lease. I wonder if I hit a race condition which I can no longer reproduce, where one worker had the lease and did not give it up and subsequent workers just failed to recover the lease. In which case, 8354 is not the culprit but I still prefer the more relaxed timeout in this JIRA.

          Also, I am now a little confused with lease recovery. It seems that lease recovery can be separate from block recovery. Basically, recover lease is called the first time, we enqueue a block recovery (which is never going to happen since we try to hit the dead datanode thats not heartbeating). However the 2nd call still returns true which confuses me since the block is still not finalized.

          I wonder if lease recovery means anything other than, flipping something at the namenode saying who has the lease to the file. But its quite possible that the underlying block/file has not truly been recovered.

          Eric Newton
          Do you see something similar in your namenode logs as you kill, lease recovery initiated but no real block recovery/commitSynchronization messages (both regionserver + datanode) ? When we kill region server + datanode, we basically kill the primary or the first datanode which holds the block - this is the same datanode which would be chosen for block recovery..

          Thanks
          Varun

          Show
          Varun Sharma added a comment - Okay I did some testing with v5 and the MTTR was pretty good - 2-3 minutes - log splitting took 20-30 seconds - basically around 1 minute of this time was to replay the edits from the recovered_edits. This was with the stale node patches (3703 and 3912) however for HDFS. Also had tight dfs.socket.timeout=30. I mostly, suspended the Datanode and the region server packages at the same time. I also ran a test where I used iptables to firewall against all traffic to the host except "ssh" traffic. However, the weird thing was, I also tried to reproduce the failure scenario above, with is setting the timeout at 1 second and I could not. I looked into the NN logs and this is what happened. Lease recovery was called the 1st time and a block recovery was initiated with the dead datanode (no HDFS 4721). Lease recovery was called the 2nd time and it returned true almost every time I ran these tests. This is something that I did not see, the last time around. The logs I attached above show that a release recovery is called once by one SplitLogWorker, followed by 25 calls by another worker, followed by another 25 and eventually hundreds of calls the 3rd time. The 25 calls make sense since each split worker has a task level timeout of 25 seconds and we do recoverLease every second. Also there are 3 resubmissions, so the last worker is trying to get back the lease. I wonder if I hit a race condition which I can no longer reproduce, where one worker had the lease and did not give it up and subsequent workers just failed to recover the lease. In which case, 8354 is not the culprit but I still prefer the more relaxed timeout in this JIRA. Also, I am now a little confused with lease recovery. It seems that lease recovery can be separate from block recovery. Basically, recover lease is called the first time, we enqueue a block recovery (which is never going to happen since we try to hit the dead datanode thats not heartbeating). However the 2nd call still returns true which confuses me since the block is still not finalized. I wonder if lease recovery means anything other than, flipping something at the namenode saying who has the lease to the file. But its quite possible that the underlying block/file has not truly been recovered. Eric Newton Do you see something similar in your namenode logs as you kill, lease recovery initiated but no real block recovery/commitSynchronization messages (both regionserver + datanode) ? When we kill region server + datanode, we basically kill the primary or the first datanode which holds the block - this is the same datanode which would be chosen for block recovery.. Thanks Varun
          Hide
          Eric Newton added a comment -

          I will experiment tomorrow to see what happens when I firewall a whole node.

          Our master calls recoverLease and does not initiate log sorting until it returns true. I know from our logs that it takes several calls to the NN, and sometimes as long as 30 seconds. This is from memory, so I'll double check.

          Show
          Eric Newton added a comment - I will experiment tomorrow to see what happens when I firewall a whole node. Our master calls recoverLease and does not initiate log sorting until it returns true. I know from our logs that it takes several calls to the NN, and sometimes as long as 30 seconds. This is from memory, so I'll double check.
          Hide
          Varun Sharma added a comment -

          Thanks Eric Newton

          This is something that I am not seeing in the tests I ran yesterday. The 2nd call almost always succeeds and returns null. Also, when the master calls recoverLease, are they corresponding CommitBlockSynchronization messages in the NN log ?

          You may find this helpful - http://www.cyberciti.biz/tips/linux-iptables-4-block-all-incoming-traffic-but-allow-ssh.html

          Show
          Varun Sharma added a comment - Thanks Eric Newton This is something that I am not seeing in the tests I ran yesterday. The 2nd call almost always succeeds and returns null. Also, when the master calls recoverLease, are they corresponding CommitBlockSynchronization messages in the NN log ? You may find this helpful - http://www.cyberciti.biz/tips/linux-iptables-4-block-all-incoming-traffic-but-allow-ssh.html
          Hide
          Enis Soztutar added a comment -

          Also, I am now a little confused with lease recovery. It seems that lease recovery can be separate from block recovery.

          Lease recovery is a two step process. It does lease recovery first, where NN assigns the lease to itself, generates a new GS, which should ensure that the client cannot do any more NN operations. Then it does block recovery with another GS, where it schedules the primary DN to recover the block. This document https://issues.apache.org/jira/secure/attachment/12445209/appendDesign3.pdf has some details on the process. Not sure how current it is though.

          However the 2nd call still returns true which confuses me since the block is still not finalized.

          Do you have any logs that correspond to this. If so, this is clearly a bug in HDFS. If we can repro this, we can have HDFS folks look into it.

          Show
          Enis Soztutar added a comment - Also, I am now a little confused with lease recovery. It seems that lease recovery can be separate from block recovery. Lease recovery is a two step process. It does lease recovery first, where NN assigns the lease to itself, generates a new GS, which should ensure that the client cannot do any more NN operations. Then it does block recovery with another GS, where it schedules the primary DN to recover the block. This document https://issues.apache.org/jira/secure/attachment/12445209/appendDesign3.pdf has some details on the process. Not sure how current it is though. However the 2nd call still returns true which confuses me since the block is still not finalized. Do you have any logs that correspond to this. If so, this is clearly a bug in HDFS. If we can repro this, we can have HDFS folks look into it.
          Hide
          Ted Yu added a comment -

          Looking at Accumulo pom.xml, I think hadoop 1.0.4 is used.
          I think Varun is testing something close to hdfs 2.0

          There may be difference when we compare logs from Namenode and Datanodes.

          @Varun:
          If you can attach NN log showing what you described above, that would be nice.

          Thanks

          Show
          Ted Yu added a comment - Looking at Accumulo pom.xml, I think hadoop 1.0.4 is used. I think Varun is testing something close to hdfs 2.0 There may be difference when we compare logs from Namenode and Datanodes. @Varun: If you can attach NN log showing what you described above, that would be nice. Thanks
          Hide
          Eric Newton added a comment -

          Ted Yu, accumulo is being stress-tested against 2.0.3-alpha these days; it requires building with a specific profile, but works well.

          We are presently testing the first release of accumulo to use HDFS for WALogs, so it is critical we know the behavior of recoverLease. We have not had any data loss since we began watching the return result of recoverLease; we are growing more confident that it works correctly.

          Show
          Eric Newton added a comment - Ted Yu , accumulo is being stress-tested against 2.0.3-alpha these days; it requires building with a specific profile, but works well. We are presently testing the first release of accumulo to use HDFS for WALogs, so it is critical we know the behavior of recoverLease. We have not had any data loss since we began watching the return result of recoverLease; we are growing more confident that it works correctly.
          Hide
          Varun Sharma added a comment -

          This is Hadoop 2.0.0 alpha CDH 4.2 - namenode logs - this all there is for this block
          LOG LINE FOR BLOCK CREATION (NAMENODE)

          2013-04-24 05:40:30,282 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock: /hbase/.logs/ip-10-170-15-97.ec2.internal,60020,1366780717760/ip-10-170-15-97.ec2.internal%2C60020%2C1366780717760.1366782030238. BP-889095791-10.171.1.40-1366491606582 blk_-2482251885029951704_11942

          {blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.170.15.97:50010|RBW], ReplicaUnderConstruction[10.168.12.138:50010|RBW], ReplicaUnderConstruction[10.170.6.131:50010|RBW]]}

          LOG LINES FOR RECOVERY INITIATION (NAMENODE)

          2013-04-24 06:14:43,623 INFO BlockStateChange: BLOCK* blk_-2482251885029951704_11942

          {blockUCState=UNDER_RECOVERY, primaryNodeIndex=0, replicas=[ReplicaUnderConstruction[10.170.15.97:50010|RBW], ReplicaUnderConstruction[10.168.12.138:50010|RBW], ReplicaUnderConstruction[10.170.6.131:50010|RBW]]}

          recovery started, primary=10.170.15.97:50010
          2013-04-24 06:14:43,623 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File /hbase/.logs/ip-10-170-15-97.ec2.internal,60020,1366780717760-splitting/ip-10-170-15-97.ec2.internal%2C60020%2C1366780717760.1366782030238 has not been closed. Lease recovery is in progress. RecoveryId = 12012 for block blk_-2482251885029951704_11942

          {blockUCState=UNDER_RECOVERY, primaryNodeIndex=0, replicas=[ReplicaUnderConstruction[10.170.15.97:50010|RBW], ReplicaUnderConstruction[10.168.12.138:50010|RBW], ReplicaUnderConstruction[10.170.6.131:50010|RBW]]}

          Note that the primary index is 0 - which is the datanode i killed. This was chosen as the primary DN for lease recovery. Obviously it will not work isnce the node is dead. But recoverLease returned true neverthless for the next call. Now I am not sure if that is expected behaviour since the real block recovery never happened.

          Show
          Varun Sharma added a comment - This is Hadoop 2.0.0 alpha CDH 4.2 - namenode logs - this all there is for this block LOG LINE FOR BLOCK CREATION (NAMENODE) 2013-04-24 05:40:30,282 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock: /hbase/.logs/ip-10-170-15-97.ec2.internal,60020,1366780717760/ip-10-170-15-97.ec2.internal%2C60020%2C1366780717760.1366782030238. BP-889095791-10.171.1.40-1366491606582 blk_-2482251885029951704_11942 {blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.170.15.97:50010|RBW], ReplicaUnderConstruction[10.168.12.138:50010|RBW], ReplicaUnderConstruction[10.170.6.131:50010|RBW]]} LOG LINES FOR RECOVERY INITIATION (NAMENODE) 2013-04-24 06:14:43,623 INFO BlockStateChange: BLOCK* blk_-2482251885029951704_11942 {blockUCState=UNDER_RECOVERY, primaryNodeIndex=0, replicas=[ReplicaUnderConstruction[10.170.15.97:50010|RBW], ReplicaUnderConstruction[10.168.12.138:50010|RBW], ReplicaUnderConstruction[10.170.6.131:50010|RBW]]} recovery started, primary=10.170.15.97:50010 2013-04-24 06:14:43,623 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File /hbase/.logs/ip-10-170-15-97.ec2.internal,60020,1366780717760-splitting/ip-10-170-15-97.ec2.internal%2C60020%2C1366780717760.1366782030238 has not been closed. Lease recovery is in progress. RecoveryId = 12012 for block blk_-2482251885029951704_11942 {blockUCState=UNDER_RECOVERY, primaryNodeIndex=0, replicas=[ReplicaUnderConstruction[10.170.15.97:50010|RBW], ReplicaUnderConstruction[10.168.12.138:50010|RBW], ReplicaUnderConstruction[10.170.6.131:50010|RBW]]} Note that the primary index is 0 - which is the datanode i killed. This was chosen as the primary DN for lease recovery. Obviously it will not work isnce the node is dead. But recoverLease returned true neverthless for the next call. Now I am not sure if that is expected behaviour since the real block recovery never happened.
          Hide
          Varun Sharma added a comment -

          Eric Newton
          Thanks Eric for checking (I guess we should have a similar data loss checker for hbase)

          I wonder if you could look at your NN logs and see if you do get commitBlockSynchronization() log messages when the recoverLease method is called. I am trying to figure out why the block is not getting recovered and recoverlease is still returning true. These show up like this

          2013-04-24 16:38:26,254 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: commitBlockSynchronization(lastblock=BP-889095791-10.171.1.40-1366491606582:blk_-2482251885029951704_11942, newgenerationstamp=12012, newlength=7044280, newtargets=[10.170.15.97:50010], closeFile=true, deleteBlock=false)

          Show
          Varun Sharma added a comment - Eric Newton Thanks Eric for checking (I guess we should have a similar data loss checker for hbase) I wonder if you could look at your NN logs and see if you do get commitBlockSynchronization() log messages when the recoverLease method is called. I am trying to figure out why the block is not getting recovered and recoverlease is still returning true. These show up like this 2013-04-24 16:38:26,254 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: commitBlockSynchronization(lastblock=BP-889095791-10.171.1.40-1366491606582:blk_-2482251885029951704_11942, newgenerationstamp=12012, newlength=7044280, newtargets= [10.170.15.97:50010] , closeFile=true, deleteBlock=false)
          Hide
          Enis Soztutar added a comment -

          I guess we should have a similar data loss checker for hbase

          In fact we do. We have been using the goraci tool, which is a port of the accumulo test. Then we ported that to hbase proper as well. The class name is IntegrationTestBigLinkedList. We also have other load / verify tools as well, namely, LoadTestTool and IntegrationTestLoadAndVerify.
          We also have a tool called ChaosMonkey that does the killing of servers (see http://hbase.apache.org/book/hbase.tests.html).

          But recoverLease returned true neverthless for the next call. Now I am not sure if that is expected behaviour since the real block recovery never happened.

          For each call, NN will choose the primary NN in round robin fashion. See BlockInfoUnderConstruction#initializeBlockRecovery():

              int previous = primaryNodeIndex;
              for(int i = 1; i <= replicas.size(); i++) {
                int j = (previous + i)%replicas.size();
                if (replicas.get(j).isAlive()) {
                  primaryNodeIndex = j;
                  DatanodeDescriptor primary = replicas.get(j).getExpectedLocation(); 
                  primary.addBlockToBeRecovered(this);
                  NameNode.stateChangeLog.info("BLOCK* " + this
                    + " recovery started, primary=" + primary);
          

          Are those logs above from the first call? Do you have logs that correspond to second or third call? Before returning true, you should see a log from:

              if(nrCompleteBlocks == nrBlocks) {
                finalizeINodeFileUnderConstruction(src, pendingFile);
                NameNode.stateChangeLog.warn("BLOCK*"
                  + " internalReleaseLease: All existing blocks are COMPLETE,"
                  + " lease removed, file closed.");
                return true;  // closed!
              }
          
          Show
          Enis Soztutar added a comment - I guess we should have a similar data loss checker for hbase In fact we do. We have been using the goraci tool, which is a port of the accumulo test. Then we ported that to hbase proper as well. The class name is IntegrationTestBigLinkedList. We also have other load / verify tools as well, namely, LoadTestTool and IntegrationTestLoadAndVerify. We also have a tool called ChaosMonkey that does the killing of servers (see http://hbase.apache.org/book/hbase.tests.html ). But recoverLease returned true neverthless for the next call. Now I am not sure if that is expected behaviour since the real block recovery never happened. For each call, NN will choose the primary NN in round robin fashion. See BlockInfoUnderConstruction#initializeBlockRecovery(): int previous = primaryNodeIndex; for ( int i = 1; i <= replicas.size(); i++) { int j = (previous + i)%replicas.size(); if (replicas.get(j).isAlive()) { primaryNodeIndex = j; DatanodeDescriptor primary = replicas.get(j).getExpectedLocation(); primary.addBlockToBeRecovered( this ); NameNode.stateChangeLog.info( "BLOCK* " + this + " recovery started, primary=" + primary); Are those logs above from the first call? Do you have logs that correspond to second or third call? Before returning true, you should see a log from: if (nrCompleteBlocks == nrBlocks) { finalizeINodeFileUnderConstruction(src, pendingFile); NameNode.stateChangeLog.warn( "BLOCK*" + " internalReleaseLease: All existing blocks are COMPLETE," + " lease removed, file closed." ); return true ; // closed! }
          Hide
          Varun Sharma added a comment -

          There is no second call - because the 2nd call returns true - I am following up on this in HDFS 4721

          Show
          Varun Sharma added a comment - There is no second call - because the 2nd call returns true - I am following up on this in HDFS 4721
          Hide
          Ted Yu added a comment -

          Eric Newton:
          Thanks for the update about hadoop version that you use.

          2.0.4-alpha is being released. You may want to check it out.

          Show
          Ted Yu added a comment - Eric Newton : Thanks for the update about hadoop version that you use. 2.0.4-alpha is being released. You may want to check it out.
          Hide
          Varun Sharma added a comment -

          Alright, so it seems I have been stupid in running the recent tests. The lease recovery is correct in hadoop. I forgot what v5 patch exactly does, it reverts to old behaviour - I kept searching the namenode logs for multiple lease recoveries

          HDFS timeouts (for region server and HDFS) - socket timeout = 3 seconds, socket write timeout = 5 seconds and ipc connect retries = 0 (timeout is hardcoded at 20 seconds which is way too high)

          I am summarizing each case:
          1) After this patch,
          When we split a log, we will do the following:
          a) Call recoverLease, which will enqueue a block recovery to the dead datanode, so a noop
          b) sleep 4 seconds
          c) Break the loop and access the file irrespective of whether recovery happened
          d) Sometimes fail but eventually get through

          Note that lease recovery has not happened. If hbase finds a zero size hlog at any of the datanodes (the size is typically zero at the namenode since the file is not closed yet), it will error out and unassign the task, some other region server will pick up the split task. From the hbase console, I am always seeing non zero edits being split - so we are reading data. I am not sure if accumulo does similar checks for zero sized WALs, but Eric Newton will know better.

          Since lease recovery has not happened, we risk data loss but it again depends on what kind of data loss accumulo sees, whether entire WAL(s) are lost or portions of WAL(s). If its entire WAL(s), maybe the zero sized check in HBase saves it from data loss. But if portions of WAL are being lost in accumulo when recoverLease return value is not checked, then we can have data loss after v5 patch. Again I will let Eric Newton speak on that.

          The good news though is that I am seeing pretty good MTTR in this case. Its typically 2-3 minutes and WAL splitting accounts for maybe 30-40 seconds. But note that I am running with HDFS 3912, 3703 and that my HDFS timeouts are configured to fail fast.

          2) Before this patch but after 8354

          We have the issue where lease recoveries pile up on the namenode faster than they can be served (every second), the side effect is that each latter recovery preempts the earlier one. Basically with HDFS it is simply not possible to get lease recovery within 4 seconds unless we use some of the stale node patches and really tighten all the HDFS timeouts and retries. So recoveries never finish in one second and they keep piling up and preempting earlier recoveries. Eventually we wait for 300 seconds, hbase.lease.recovery.timeout, after which we just open the file and mostly the last recovery has succeeded by then.

          MTTR is not good in this case - at least 6 minutes for log splitting. On possibility could have been to reduce the number 300 seconds to maybe 20 seconds.

          3) One can have the best of both worlds - a good MTTR and no/little data loss by opening files after real lease recovery has happened to avoid data corruption. For that, one would need to tune their HDFS timeouts to be low, the connect + socket timeouts, so that lease recoveries can happen within 5-10 seconds. I think that, for such cases we should have a parameter, saying whether we want to force lease recovery before - I am going to raise a JIRA to discuss that configuration. Overall, if we had an isClosed() API life would be so much easier but a large number of hadoop releases do not have it, yet. I think this is more of a power user configuration but it probably makes sense to have one.

          Thanks !

          Show
          Varun Sharma added a comment - Alright, so it seems I have been stupid in running the recent tests. The lease recovery is correct in hadoop. I forgot what v5 patch exactly does, it reverts to old behaviour - I kept searching the namenode logs for multiple lease recoveries HDFS timeouts (for region server and HDFS) - socket timeout = 3 seconds, socket write timeout = 5 seconds and ipc connect retries = 0 (timeout is hardcoded at 20 seconds which is way too high) I am summarizing each case: 1) After this patch, When we split a log, we will do the following: a) Call recoverLease, which will enqueue a block recovery to the dead datanode, so a noop b) sleep 4 seconds c) Break the loop and access the file irrespective of whether recovery happened d) Sometimes fail but eventually get through Note that lease recovery has not happened. If hbase finds a zero size hlog at any of the datanodes (the size is typically zero at the namenode since the file is not closed yet), it will error out and unassign the task, some other region server will pick up the split task. From the hbase console, I am always seeing non zero edits being split - so we are reading data. I am not sure if accumulo does similar checks for zero sized WALs, but Eric Newton will know better. Since lease recovery has not happened, we risk data loss but it again depends on what kind of data loss accumulo sees, whether entire WAL(s) are lost or portions of WAL(s). If its entire WAL(s), maybe the zero sized check in HBase saves it from data loss. But if portions of WAL are being lost in accumulo when recoverLease return value is not checked, then we can have data loss after v5 patch. Again I will let Eric Newton speak on that. The good news though is that I am seeing pretty good MTTR in this case. Its typically 2-3 minutes and WAL splitting accounts for maybe 30-40 seconds. But note that I am running with HDFS 3912, 3703 and that my HDFS timeouts are configured to fail fast. 2) Before this patch but after 8354 We have the issue where lease recoveries pile up on the namenode faster than they can be served (every second), the side effect is that each latter recovery preempts the earlier one. Basically with HDFS it is simply not possible to get lease recovery within 4 seconds unless we use some of the stale node patches and really tighten all the HDFS timeouts and retries. So recoveries never finish in one second and they keep piling up and preempting earlier recoveries. Eventually we wait for 300 seconds, hbase.lease.recovery.timeout, after which we just open the file and mostly the last recovery has succeeded by then. MTTR is not good in this case - at least 6 minutes for log splitting. On possibility could have been to reduce the number 300 seconds to maybe 20 seconds. 3) One can have the best of both worlds - a good MTTR and no/little data loss by opening files after real lease recovery has happened to avoid data corruption. For that, one would need to tune their HDFS timeouts to be low, the connect + socket timeouts, so that lease recoveries can happen within 5-10 seconds. I think that, for such cases we should have a parameter, saying whether we want to force lease recovery before - I am going to raise a JIRA to discuss that configuration. Overall, if we had an isClosed() API life would be so much easier but a large number of hadoop releases do not have it, yet. I think this is more of a power user configuration but it probably makes sense to have one. Thanks !
          Hide
          Varun Sharma added a comment -

          As I said above, there is no bug in HDFS - for a moment I thought that patch v5 only increased the timeout from 1s to 4s but it also reverts to old behaviour of not enforcing lease recovery so that we can reduce MTTR. If we choose to enforce lease recovery, then if this timeout is significantly lower than the time it takes to recover the lease (if we can't recover within 4s), our MTTR will be poor.

          Show
          Varun Sharma added a comment - As I said above, there is no bug in HDFS - for a moment I thought that patch v5 only increased the timeout from 1s to 4s but it also reverts to old behaviour of not enforcing lease recovery so that we can reduce MTTR. If we choose to enforce lease recovery, then if this timeout is significantly lower than the time it takes to recover the lease (if we can't recover within 4s), our MTTR will be poor.
          Hide
          Nicolas Liochon added a comment -

          Basically with HDFS it is simply not possible to get lease recovery within 4 seconds unless we use some of the stale node patches and really tighten all the HDFS timeouts and retries.

          Stale node is mandatory if you care about mttr and performances during failures. I'm more scare about putting the timeouts to very low numbers.

          I have the following improvements in mind:

          • have the master starts the recovery and not the region server
          • in hdfs recovery, retry only once.
          • add an API in the namenode to mark a datanode as staled. This would be used by the master before doing the recoverLease.
          • when splitting, put a higher priority on files already closed
          • using the locality when reading the wal is in HBASE-6772. This should decrease the split time.

          A brittle option would be to start the split while the recovery is in progress, and synchronize on fileClosed only at the end. This would be may be reasonable on top of HBASE-7006.

          The v5 in 0.94 is just a way to buy time, without risking regression, not a long term solution. Points 1 to 3 are acceptable in 0.94 while point 4 & 5 is for trunk (at least at the beginning) imho.

          The tests here (on trunk) still show some random issues, but I would expect that we can make it work with a hdfs timeout of 30s.

          Varun, From your tests, with 0.94 and an hdfs that have stale mode & the 3 first point above, do you think it would work?

          Show
          Nicolas Liochon added a comment - Basically with HDFS it is simply not possible to get lease recovery within 4 seconds unless we use some of the stale node patches and really tighten all the HDFS timeouts and retries. Stale node is mandatory if you care about mttr and performances during failures. I'm more scare about putting the timeouts to very low numbers. I have the following improvements in mind: have the master starts the recovery and not the region server in hdfs recovery, retry only once. add an API in the namenode to mark a datanode as staled. This would be used by the master before doing the recoverLease. when splitting, put a higher priority on files already closed using the locality when reading the wal is in HBASE-6772 . This should decrease the split time. A brittle option would be to start the split while the recovery is in progress, and synchronize on fileClosed only at the end. This would be may be reasonable on top of HBASE-7006 . The v5 in 0.94 is just a way to buy time, without risking regression, not a long term solution. Points 1 to 3 are acceptable in 0.94 while point 4 & 5 is for trunk (at least at the beginning) imho. The tests here (on trunk) still show some random issues, but I would expect that we can make it work with a hdfs timeout of 30s. Varun, From your tests, with 0.94 and an hdfs that have stale mode & the 3 first point above, do you think it would work?
          Hide
          Varun Sharma added a comment -

          Hi Nicholas,

          Firstly I configure the HDFS cluster in the following way:

          dfs.socket.timeout = 3sec
          dfs.socket.write.timeout = 5sec
          ipc.client.connect.timeout = 1sec
          ipc.client.connect.max.retries.on.timeouts = 2 (hence total 3 retries)

          The connect timeout is low since connecting should really be very fast unless something major is wrong. Our clusters are housed within the same AZ on amazon EC2 and it is very rare to see this timeouts even getting hit on EC2 which is known for poor I/O performance. I, for most, see this timeouts kick in during failures. Note that these timeouts are only used for avoiding bad datanodes and not for marking nodes as dead/stale, so i think these timeouts are okay for quick failovers - we already have high timeouts for dead node detection/zookeeper session (10's of seconds).

          stale node timeout = 20 seconds
          dead node timeout = 10 minutes
          ZooKeeper session timeout = 30 seconds

          HDFS is hadoop 2.0 with HDFS 3703, HDFS 3912 and HDFS 4721. The approach is the following:

          a) A node is failed artificially using
          1) Use iptables to only allow ssh traffic and drop all traffic
          2) Suspending the processes

          b) Even though we configure stale detection to be faster than hbase detection, lets assume that does not play out. The node is not marked stale.

          c) Lease recovery attempt # 1
          i) We choose a good primary node for recovery - since its likely that the bad node has the worst possible heartbeat (HDFS 4721)
          ii) But we point it to recover from all 3 nodes since we are considering the worst case where no node is marked stale
          iii) The primary tries to reconcile the block with all 3 nodes and hits either
          a) dfs.socket.timeout = 3 seconds - if process is suspended
          b) ipc.connect.timeout X ipc.connect.retries which is 3 * 1 second + 3 * 1 second sleep = 6 seconds - if we firewall the host using iptables

          d) If we use a value of 4 seconds, the first recovery attempt does not finish in time and we initiate lease recovery #2
          i) Either a rinse and repeat of c) happens
          ii) Or the node is now stale and the block is instantly recovered from the remaining two replicas

          I think by we could either adjust the timeout 4 seconds to be say 8 seconds and mostly be able to get the first attempt successful or otherwise, we just wait to get stale node detection and then we will have a fairly quick block recovery due to HDFS 4721.

          I will try to test these values tomorrow, by rebooting some nodes...

          Show
          Varun Sharma added a comment - Hi Nicholas, Firstly I configure the HDFS cluster in the following way: dfs.socket.timeout = 3sec dfs.socket.write.timeout = 5sec ipc.client.connect.timeout = 1sec ipc.client.connect.max.retries.on.timeouts = 2 (hence total 3 retries) The connect timeout is low since connecting should really be very fast unless something major is wrong. Our clusters are housed within the same AZ on amazon EC2 and it is very rare to see this timeouts even getting hit on EC2 which is known for poor I/O performance. I, for most, see this timeouts kick in during failures. Note that these timeouts are only used for avoiding bad datanodes and not for marking nodes as dead/stale, so i think these timeouts are okay for quick failovers - we already have high timeouts for dead node detection/zookeeper session (10's of seconds). stale node timeout = 20 seconds dead node timeout = 10 minutes ZooKeeper session timeout = 30 seconds HDFS is hadoop 2.0 with HDFS 3703, HDFS 3912 and HDFS 4721. The approach is the following: a) A node is failed artificially using 1) Use iptables to only allow ssh traffic and drop all traffic 2) Suspending the processes b) Even though we configure stale detection to be faster than hbase detection, lets assume that does not play out. The node is not marked stale. c) Lease recovery attempt # 1 i) We choose a good primary node for recovery - since its likely that the bad node has the worst possible heartbeat (HDFS 4721) ii) But we point it to recover from all 3 nodes since we are considering the worst case where no node is marked stale iii) The primary tries to reconcile the block with all 3 nodes and hits either a) dfs.socket.timeout = 3 seconds - if process is suspended b) ipc.connect.timeout X ipc.connect.retries which is 3 * 1 second + 3 * 1 second sleep = 6 seconds - if we firewall the host using iptables d) If we use a value of 4 seconds, the first recovery attempt does not finish in time and we initiate lease recovery #2 i) Either a rinse and repeat of c) happens ii) Or the node is now stale and the block is instantly recovered from the remaining two replicas I think by we could either adjust the timeout 4 seconds to be say 8 seconds and mostly be able to get the first attempt successful or otherwise, we just wait to get stale node detection and then we will have a fairly quick block recovery due to HDFS 4721. I will try to test these values tomorrow, by rebooting some nodes...
          Hide
          Nicolas Liochon added a comment -

          b) Even though we configure stale detection to be faster than hbase detection, lets assume that does not play out. The node is not marked stale.

          Yep. If we have the API to make a node as stale, it lowers a lot the probability to try to go to a stale node while recovering, and if we limit the number of tries to one, we should spent less than 1 minute trying the wrong node.

          Show
          Nicolas Liochon added a comment - b) Even though we configure stale detection to be faster than hbase detection, lets assume that does not play out. The node is not marked stale. Yep. If we have the API to make a node as stale, it lowers a lot the probability to try to go to a stale node while recovering, and if we limit the number of tries to one, we should spent less than 1 minute trying the wrong node.
          Hide
          Varun Sharma added a comment -

          Okay. I am just not 100 % sure if we want to mark the datanode as stale if the region server crashes alone...

          Show
          Varun Sharma added a comment - Okay. I am just not 100 % sure if we want to mark the datanode as stale if the region server crashes alone...
          Hide
          Nicolas Liochon added a comment -

          I am just not 100 % sure if we want to mark the datanode as stale if the region server crashes alone...

          In the HDFS jira, I proposed to put a duration: that limit the effect if there are false positive.
          We could as well do a check before marking the datanode as stale. It's not necessary short term imho, but we can add a hook in case someone wants to do it.

          Show
          Nicolas Liochon added a comment - I am just not 100 % sure if we want to mark the datanode as stale if the region server crashes alone... In the HDFS jira, I proposed to put a duration: that limit the effect if there are false positive. We could as well do a check before marking the datanode as stale. It's not necessary short term imho, but we can add a hook in case someone wants to do it.
          Hide
          Eric Newton added a comment -

          I can confirm long recovery times.

          I stopped the data node and tablet server with SIGSTOP. I'm seeing messages like this from the NN:

          2013-04-25 17:55:39,923 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 on 6093, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.commitBlockSynchronization from 10.0.0.7:49820: error: java.io.IOException: The recovery id 2494 does not match current recovery id 2495 for block BP-189257095-10.0.0.3-1366907958658:blk_8428898362502069151_2174
          java.io.IOException: The recovery id 2494 does not match current recovery id 2495 for block BP-189257095-10.0.0.3-1366907958658:blk_8428898362502069151_2174
          

          Even when I use a long delay between calls to recoverLease (60s), it does not recover.

          If I kill the servers, recovery happens quickly.

          Unfortunately we see this kind of "unresponsive, but not dead" servers in practice.

          I tried these configurations, and it did not improve:

          dfs.socket.timeout = 3sec
          dfs.socket.write.timeout = 5sec
          ipc.client.connect.timeout = 1sec
          ipc.client.connect.max.retries.on.timeouts = 2 
          
          Show
          Eric Newton added a comment - I can confirm long recovery times. I stopped the data node and tablet server with SIGSTOP. I'm seeing messages like this from the NN: 2013-04-25 17:55:39,923 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 on 6093, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.commitBlockSynchronization from 10.0.0.7:49820: error: java.io.IOException: The recovery id 2494 does not match current recovery id 2495 for block BP-189257095-10.0.0.3-1366907958658:blk_8428898362502069151_2174 java.io.IOException: The recovery id 2494 does not match current recovery id 2495 for block BP-189257095-10.0.0.3-1366907958658:blk_8428898362502069151_2174 Even when I use a long delay between calls to recoverLease (60s), it does not recover. If I kill the servers, recovery happens quickly. Unfortunately we see this kind of "unresponsive, but not dead" servers in practice. I tried these configurations, and it did not improve: dfs.socket.timeout = 3sec dfs.socket.write.timeout = 5sec ipc.client.connect.timeout = 1sec ipc.client.connect.max.retries.on.timeouts = 2
          Hide
          Varun Sharma added a comment -

          How long is the recovery time - the 1st recovery will always be moot unless u have HDFS 4721.

          Can you grep the namenode logs for "8428898362502069151" and paste them here ?

          Show
          Varun Sharma added a comment - How long is the recovery time - the 1st recovery will always be moot unless u have HDFS 4721. Can you grep the namenode logs for "8428898362502069151" and paste them here ?
          Hide
          Varun Sharma added a comment -

          One more question - you did apply these timeouts to both your HDFS datanodes and your accumulo DFS clients (restart) ? If yes, I expect the second recovery to succeed, hence you should recover in 2 attempts = 120 seconds

          Show
          Varun Sharma added a comment - One more question - you did apply these timeouts to both your HDFS datanodes and your accumulo DFS clients (restart) ? If yes, I expect the second recovery to succeed, hence you should recover in 2 attempts = 120 seconds
          Hide
          Eric Newton added a comment -

          I double-checked the configuration settings... and there were two typos. Now files recover as expected.

          Show
          Eric Newton added a comment - I double-checked the configuration settings... and there were two typos. Now files recover as expected.
          Hide
          Varun Sharma added a comment -

          Okay, I am done with testing for hbase using the above configuration - tight timeouts and stale node patches. I am using patch v5 and HBASE 8434 on top, to force lease recovery and not skip it.

          The testing is done by doing "kill -STOP <server_process>; kill -STOP <datanode_process>". I am forcing lease recovery so - I am applying HBASE 8434 on top which basically means keep calling recoverLease until it returns true every 5 seconds. I have not found a single case where it takes more than 30-40 seconds for recovery. The HDFS runs with 3703, 3912 and 4721 patches. So at some point the recovery succeeds within 1 second after a node becomes marked stale.

          So, I am able to consistently get log splitting finished within the 1st minute and either by the 2nd or 3rd minute all regions are back online. I have tried this a sufficient number of times to convince me that it works.

          Varun

          Show
          Varun Sharma added a comment - Okay, I am done with testing for hbase using the above configuration - tight timeouts and stale node patches. I am using patch v5 and HBASE 8434 on top, to force lease recovery and not skip it. The testing is done by doing "kill -STOP <server_process>; kill -STOP <datanode_process>". I am forcing lease recovery so - I am applying HBASE 8434 on top which basically means keep calling recoverLease until it returns true every 5 seconds. I have not found a single case where it takes more than 30-40 seconds for recovery. The HDFS runs with 3703, 3912 and 4721 patches. So at some point the recovery succeeds within 1 second after a node becomes marked stale. So, I am able to consistently get log splitting finished within the 1st minute and either by the 2nd or 3rd minute all regions are back online. I have tried this a sufficient number of times to convince me that it works. Varun
          Hide
          Varun Sharma added a comment -

          Eric Newton

          Files recover as expected.

          Can you elaborate - how many recovery attempts for success and also how long b/w retries ?

          Varun

          Show
          Varun Sharma added a comment - Eric Newton Files recover as expected. Can you elaborate - how many recovery attempts for success and also how long b/w retries ? Varun
          Hide
          stack added a comment -

          Reading over this nice, fat, info-dense issue, I am trying to figure what we need to add to trunk right now.

          Sounds like checking the recoverFileLease return checking gained us little in the end (though Varun you think we want to keep going till its true though v5 here skips out on it). The valuable finding hereabouts is the need for a pause before going ahead with file open it seems. Trunk does not have this pause. I need to add a version of v5 to trunk? (Holding our breath until an api not yet generally available, isFileClosed hbase-8394, shows up is not an option for now; nor is an expectation that all will just upgrade to an hdfs that has this api on either.)

          hbase-7878 backport is now elided since we have added back the old behavior w/ patch applied here excepting the pause of an arbitrary enough 4seconds

          The applied patch here does not loop on recoverLease after the 4seconds expire. It breaks. In trunk we loop. We should break too (...and let it fail if 0 length and then let the next split task do a new recoverLease call?)

          On the 4seconds, it seems that it rather should be the dfs timeout dfs.socket.timeout that hdfs is using – plus a second or so – rather than "4seconds" if I follow Varuns' reasoning above properly and just remove the new config 'hbase.lease.recovery.retry.interval' (We have enough configs already)?

          Sounds like we are depending on WAL sizes being < HDFS block sizes. This will not always be the case; we could go into a second block easily if a big edit comes in on the tail of the first block; and then there may be dataloss (TBD) because we have a file size (so we think the file recovered?)

          Sounds also like we are relying file size being zero as a marker that file is not yet closed (I suppose that is ok because an empty WAL will be > 0 length IIRC. We should doc. our dependency though)

          Varun, i like your low timeouts. Would you suggest we adjust hbase default timeouts down and recommend folks change their hdfs defaults if they want better MTTR? If you had a blog post on your nice work done in here, I could at least point the refguide at it for those interested in improved MTTR (smile).

          Show
          stack added a comment - Reading over this nice, fat, info-dense issue, I am trying to figure what we need to add to trunk right now. Sounds like checking the recoverFileLease return checking gained us little in the end (though Varun you think we want to keep going till its true though v5 here skips out on it). The valuable finding hereabouts is the need for a pause before going ahead with file open it seems. Trunk does not have this pause. I need to add a version of v5 to trunk? (Holding our breath until an api not yet generally available, isFileClosed hbase-8394, shows up is not an option for now; nor is an expectation that all will just upgrade to an hdfs that has this api on either.) hbase-7878 backport is now elided since we have added back the old behavior w/ patch applied here excepting the pause of an arbitrary enough 4seconds The applied patch here does not loop on recoverLease after the 4seconds expire. It breaks. In trunk we loop. We should break too (...and let it fail if 0 length and then let the next split task do a new recoverLease call?) On the 4seconds, it seems that it rather should be the dfs timeout dfs.socket.timeout that hdfs is using – plus a second or so – rather than "4seconds" if I follow Varuns' reasoning above properly and just remove the new config 'hbase.lease.recovery.retry.interval' (We have enough configs already)? Sounds like we are depending on WAL sizes being < HDFS block sizes. This will not always be the case; we could go into a second block easily if a big edit comes in on the tail of the first block; and then there may be dataloss (TBD) because we have a file size (so we think the file recovered?) Sounds also like we are relying file size being zero as a marker that file is not yet closed (I suppose that is ok because an empty WAL will be > 0 length IIRC. We should doc. our dependency though) Varun, i like your low timeouts. Would you suggest we adjust hbase default timeouts down and recommend folks change their hdfs defaults if they want better MTTR? If you had a blog post on your nice work done in here, I could at least point the refguide at it for those interested in improved MTTR (smile).
          Hide
          Nicolas Liochon added a comment -

          Varun, I +1 Stack: the timeout setting you mentionned are quite impressive!
          Thanks a lot for all this work.

          Here is my understanding, please correct me where I'm wrong.

          In don't think that single / multiple block is an issue, even if it's better to have single block (increased parallelism).

          HBase has a dataloss risk: we need to wait for the end of recoverFileLease before reading.
          => Either by polling the NN and calling recoverFileLease multiple times
          => Either calling isFileClosed (HDFS-4525) (and polling as well) where it's available.

          I'm not sure that we can poll every second recoverFileLease. When I try I have the same logs as Eric: "java.io.IOException: The recovery id 2494 does not match current recovery id 2495 for block", and the state of the namenode seems strange.

          In critical scenarios, the recoverFileLease won't happen at all. The probability is greatly decreased by HDFS-4721, but it's not zero.

          In critical scenarios, the recoverFileLease will start, but will be stuck in bad datanodes. The probability is greatly decreased by HDFS-4721 and HDFS-4754, but it's not zero. Here, we need to limit the number of retry in HDFS to one, whatever the global setting, to be on the safe side (no hdfs jira for this).

          I see a possible common implementation (trunk / hbase 0.94)

          • if HDFS-4754, calls markAsStale to be sure this datanode won't be used.
          • call recoverFileLease a first time
          • if HDFS-4525 is available, call isFileClosed every second to detect that the recovery is done
          • every 60s, call again recoverFileLease (either isFileClosed is missing, either we went into one of the bad scenario above).

          This would mean: no dataloss and a MTTR of:

          • less than a minute if we have stale mode + HDFS-4721 + HDFS-4754 + HDFS-4525 + no retry in HDFS recoverLease or Varun's settings.
          • around 12 minutes if we have none of the above. But that's what we have already without the stale mode imho.
          • in the middle if we have a subset of the above patches and config.

          As HDFS-4721 seems validated by the HDFS dev team, I think that my only question is: can we poll very frequently recoverFileLease if we don't have isFileClosed?

          As a side node, tests more or less similar to yours with HBase trunk and HDFS branch-2 trunk (without your settings but with a hack to skip the deadnodes) brings similar results.

          Show
          Nicolas Liochon added a comment - Varun, I +1 Stack: the timeout setting you mentionned are quite impressive! Thanks a lot for all this work. Here is my understanding, please correct me where I'm wrong. In don't think that single / multiple block is an issue, even if it's better to have single block (increased parallelism). HBase has a dataloss risk: we need to wait for the end of recoverFileLease before reading. => Either by polling the NN and calling recoverFileLease multiple times => Either calling isFileClosed ( HDFS-4525 ) (and polling as well) where it's available. I'm not sure that we can poll every second recoverFileLease. When I try I have the same logs as Eric: "java.io.IOException: The recovery id 2494 does not match current recovery id 2495 for block", and the state of the namenode seems strange. In critical scenarios, the recoverFileLease won't happen at all. The probability is greatly decreased by HDFS-4721 , but it's not zero. In critical scenarios, the recoverFileLease will start, but will be stuck in bad datanodes. The probability is greatly decreased by HDFS-4721 and HDFS-4754 , but it's not zero. Here, we need to limit the number of retry in HDFS to one, whatever the global setting, to be on the safe side (no hdfs jira for this). I see a possible common implementation (trunk / hbase 0.94) if HDFS-4754 , calls markAsStale to be sure this datanode won't be used. call recoverFileLease a first time if HDFS-4525 is available, call isFileClosed every second to detect that the recovery is done every 60s, call again recoverFileLease (either isFileClosed is missing, either we went into one of the bad scenario above). This would mean: no dataloss and a MTTR of: less than a minute if we have stale mode + HDFS-4721 + HDFS-4754 + HDFS-4525 + no retry in HDFS recoverLease or Varun's settings. around 12 minutes if we have none of the above. But that's what we have already without the stale mode imho. in the middle if we have a subset of the above patches and config. As HDFS-4721 seems validated by the HDFS dev team, I think that my only question is: can we poll very frequently recoverFileLease if we don't have isFileClosed? As a side node, tests more or less similar to yours with HBase trunk and HDFS branch-2 trunk (without your settings but with a hack to skip the deadnodes) brings similar results.
          Hide
          Eric Newton added a comment -

          Can you elaborate - how many recovery attempts for success and also how long b/w retries ?

          After the tablet server loses its lock in zookeeper, the master waits 10s and calls recoverLease which returns false. After 5s, recoverLease is retried and succeeds. These are the default values for the timeouts.

          Show
          Eric Newton added a comment - Can you elaborate - how many recovery attempts for success and also how long b/w retries ? After the tablet server loses its lock in zookeeper, the master waits 10s and calls recoverLease which returns false. After 5s, recoverLease is retried and succeeds. These are the default values for the timeouts.
          Hide
          Varun Sharma added a comment -

          Stack
          I can do a small write up that folks can refer to.

          Nicolas Liochon
          One point regarding the low setting though. Its good for fast MTTR requirements such as online clusters but it does not work well if you pound a small cluster with mapreduce jobs. The write timeouts start kicking in on datanodes - we saw this on a small cluster. So it has to be taken with a pinch of salt.

          I think 4 seconds might be too tight. Because we have the following sequence -
          1) recoverLease called
          2) The primary node heartbeats (this can be 3 seconds in the worst case)
          3) There are multiple timeouts during recovery at primary datanode:
          a) dfs.socket.timeout kicks in when we suspend the processes using "kill -STOP" - there is only 1 retry
          b) ipc.client.connect.timeout is the troublemaker - on old hadoop versions it is hardcoded at 20 seconds. On some versions, the # of retries is hardcoded at 45. This can be trigger by firewalling a host using iptables to drop all incoming/outgoing TCP packets. Another issue here is that b/w the timeouts there is a 1 second hardcoded sleep - I just fixed it in HADOOP 9503. If we make sure that all the dfs.socket.timeout and ipc client settings are the same in hbase-site.xml and hdfs-site.xml. Then, we can

          The retry rate should be no faster than 3a and 3b - or lease recoveries will accumulate for 900 seconds in trunk. To get around this problem, we would want to make sure that hbase-site.xml has the same settings as hdfs-site.xml. And we calculate the recovery interval from those settings. Otherwise, we can leave a release note saying that this number should be max(dfs.socket.timeout, ipc.client.connect.max.retries.on.timeouts * ipc.client.connect.timeout, ipc.client.connect.max.retries).

          The advantage of having HDFS 4721 is that at some point the data node will be recognized as stale - maybe a little later than hdfs recovery. Once that happens, recoveries typically occuring within 2 seconds.

          Show
          Varun Sharma added a comment - Stack I can do a small write up that folks can refer to. Nicolas Liochon One point regarding the low setting though. Its good for fast MTTR requirements such as online clusters but it does not work well if you pound a small cluster with mapreduce jobs. The write timeouts start kicking in on datanodes - we saw this on a small cluster. So it has to be taken with a pinch of salt. I think 4 seconds might be too tight. Because we have the following sequence - 1) recoverLease called 2) The primary node heartbeats (this can be 3 seconds in the worst case) 3) There are multiple timeouts during recovery at primary datanode: a) dfs.socket.timeout kicks in when we suspend the processes using "kill -STOP" - there is only 1 retry b) ipc.client.connect.timeout is the troublemaker - on old hadoop versions it is hardcoded at 20 seconds. On some versions, the # of retries is hardcoded at 45. This can be trigger by firewalling a host using iptables to drop all incoming/outgoing TCP packets. Another issue here is that b/w the timeouts there is a 1 second hardcoded sleep - I just fixed it in HADOOP 9503. If we make sure that all the dfs.socket.timeout and ipc client settings are the same in hbase-site.xml and hdfs-site.xml. Then, we can The retry rate should be no faster than 3a and 3b - or lease recoveries will accumulate for 900 seconds in trunk. To get around this problem, we would want to make sure that hbase-site.xml has the same settings as hdfs-site.xml. And we calculate the recovery interval from those settings. Otherwise, we can leave a release note saying that this number should be max(dfs.socket.timeout, ipc.client.connect.max.retries.on.timeouts * ipc.client.connect.timeout, ipc.client.connect.max.retries). The advantage of having HDFS 4721 is that at some point the data node will be recognized as stale - maybe a little later than hdfs recovery. Once that happens, recoveries typically occuring within 2 seconds.
          Hide
          Ted Yu added a comment -

          @Varun:

          Then, we can

          Can you complete the above sentence ?

          Show
          Ted Yu added a comment - @Varun: Then, we can Can you complete the above sentence ?
          Hide
          Varun Sharma added a comment -

          Sorry about that...

          If we make sure that all the dfs.socket.timeout and ipc client settings are the same in hbase-site.xml and hdfs-site.xml. Then, we can "do a custom calculation of recover lease retry interval inside hbase". But basically hbase needs to know in some way how the timeouts are setup underneath.

          Thanks
          Varun

          Show
          Varun Sharma added a comment - Sorry about that... If we make sure that all the dfs.socket.timeout and ipc client settings are the same in hbase-site.xml and hdfs-site.xml. Then, we can "do a custom calculation of recover lease retry interval inside hbase". But basically hbase needs to know in some way how the timeouts are setup underneath. Thanks Varun
          Hide
          Ted Yu added a comment -

          If we make sure that all the dfs.socket.timeout and ipc client settings are the same in hbase-site.xml and hdfs-site.xml.

          Should we add a check for the above at cluster startup ? If discrepancy is found, we can log a warning message.

          Show
          Ted Yu added a comment - If we make sure that all the dfs.socket.timeout and ipc client settings are the same in hbase-site.xml and hdfs-site.xml. Should we add a check for the above at cluster startup ? If discrepancy is found, we can log a warning message.
          Hide
          stack added a comment -

          varun Thanks. I made HBASE-8449 for trunk patch (and to fix what is applied here – the 4s in particular).

          Show
          stack added a comment - varun Thanks. I made HBASE-8449 for trunk patch (and to fix what is applied here – the 4s in particular).
          Hide
          Jimmy Xiang added a comment -

          Varun Sharma, when you played with the patch for this jira, does your HBase have HBASE-8314 and HBASE-8321? I think HBASE-8314 may give you extra time for the file to recover.

          Show
          Jimmy Xiang added a comment - Varun Sharma , when you played with the patch for this jira, does your HBase have HBASE-8314 and HBASE-8321 ? I think HBASE-8314 may give you extra time for the file to recover.
          Hide
          Varun Sharma added a comment -

          No, it does not. We primarily fix it by fixing it in HDFS since thats where the core of the problem is (slow recovery).

          Show
          Varun Sharma added a comment - No, it does not. We primarily fix it by fixing it in HDFS since thats where the core of the problem is (slow recovery).
          Hide
          Ted Yu added a comment -

          Can this issue be resolved ?

          Show
          Ted Yu added a comment - Can this issue be resolved ?
          Hide
          Lars Hofhansl added a comment -

          Closing. Please re-open if I'm confused.

          Show
          Lars Hofhansl added a comment - Closing. Please re-open if I'm confused.
          Hide
          Hudson added a comment -

          Integrated in HBase-0.94-security #141 (See https://builds.apache.org/job/HBase-0.94-security/141/)
          HBASE-8389 HBASE-8354 DDoSes Namenode with lease recovery requests (Varun and Ted) (Revision 1470800)

          Result = FAILURE
          tedyu :
          Files :

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/util/FSHDFSUtils.java
          Show
          Hudson added a comment - Integrated in HBase-0.94-security #141 (See https://builds.apache.org/job/HBase-0.94-security/141/ ) HBASE-8389 HBASE-8354 DDoSes Namenode with lease recovery requests (Varun and Ted) (Revision 1470800) Result = FAILURE tedyu : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/util/FSHDFSUtils.java
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK #4142 (See https://builds.apache.org/job/HBase-TRUNK/4142/)
          HBASE-8449 Refactor recoverLease retries and pauses informed by findings over in hbase-8389 (Revision 1486121)
          HBASE-8449 Refactor recoverLease retries and pauses informed by findings over in hbase-8389 (Revision 1486108)

          Result = SUCCESS

          Show
          Hudson added a comment - Integrated in HBase-TRUNK #4142 (See https://builds.apache.org/job/HBase-TRUNK/4142/ ) HBASE-8449 Refactor recoverLease retries and pauses informed by findings over in hbase-8389 (Revision 1486121) HBASE-8449 Refactor recoverLease retries and pauses informed by findings over in hbase-8389 (Revision 1486108) Result = SUCCESS
          Hide
          Hudson added a comment -

          Integrated in hbase-0.95 #214 (See https://builds.apache.org/job/hbase-0.95/214/)
          HBASE-8449 Refactor recoverLease retries and pauses informed by findings over in hbase-8389 (Revision 1486107)

          Result = SUCCESS

          Show
          Hudson added a comment - Integrated in hbase-0.95 #214 (See https://builds.apache.org/job/hbase-0.95/214/ ) HBASE-8449 Refactor recoverLease retries and pauses informed by findings over in hbase-8389 (Revision 1486107) Result = SUCCESS
          Hide
          Hudson added a comment -

          Integrated in hbase-0.95-on-hadoop2 #111 (See https://builds.apache.org/job/hbase-0.95-on-hadoop2/111/)
          HBASE-8449 Refactor recoverLease retries and pauses informed by findings over in hbase-8389 (Revision 1486107)

          Result = FAILURE

          Show
          Hudson added a comment - Integrated in hbase-0.95-on-hadoop2 #111 (See https://builds.apache.org/job/hbase-0.95-on-hadoop2/111/ ) HBASE-8449 Refactor recoverLease retries and pauses informed by findings over in hbase-8389 (Revision 1486107) Result = FAILURE
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #542 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/542/)
          HBASE-8449 Refactor recoverLease retries and pauses informed by findings over in hbase-8389 (Revision 1486121)
          HBASE-8449 Refactor recoverLease retries and pauses informed by findings over in hbase-8389 (Revision 1486108)

          Result = FAILURE

          Show
          Hudson added a comment - Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #542 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/542/ ) HBASE-8449 Refactor recoverLease retries and pauses informed by findings over in hbase-8389 (Revision 1486121) HBASE-8449 Refactor recoverLease retries and pauses informed by findings over in hbase-8389 (Revision 1486108) Result = FAILURE
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK #4144 (See https://builds.apache.org/job/HBase-TRUNK/4144/)
          Add section on mttr; cite nicolas/devaraj blog and point at hbase-8389 varun comments (Revision 1486358)

          Result = SUCCESS

          Show
          Hudson added a comment - Integrated in HBase-TRUNK #4144 (See https://builds.apache.org/job/HBase-TRUNK/4144/ ) Add section on mttr; cite nicolas/devaraj blog and point at hbase-8389 varun comments (Revision 1486358) Result = SUCCESS
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #544 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/544/)
          Add section on mttr; cite nicolas/devaraj blog and point at hbase-8389 varun comments (Revision 1486358)

          Result = FAILURE

          Show
          Hudson added a comment - Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #544 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/544/ ) Add section on mttr; cite nicolas/devaraj blog and point at hbase-8389 varun comments (Revision 1486358) Result = FAILURE

            People

            • Assignee:
              Varun Sharma
              Reporter:
              Varun Sharma
            • Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development