Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-4721

Speed up lease/block recovery when DN fails and a block goes into recovery

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.3-alpha
    • Fix Version/s: 2.1.0-beta
    • Component/s: namenode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      This was observed while doing HBase WAL recovery. HBase uses append to write to its write ahead log. So initially the pipeline is setup as

      DN1 --> DN2 --> DN3

      This WAL needs to be read when DN1 fails since it houses the HBase regionserver for the WAL.

      HBase first recovers the lease on the WAL file. During recovery, we choose DN1 as the primary DN to do the recovery even though DN1 has failed and is not heartbeating any more.

      Avoiding the stale DN1 would speed up recovery and reduce hbase MTTR. There are two options.
      a) Ride on HDFS 3703 and if stale node detection is turned on, we do not choose stale datanodes (typically not heart beated for 20-30 seconds) as primary DN(s)
      b) We sort the replicas in order of last heart beat and always pick the ones which gave the most recent heart beat

      Going to the dead datanode increases lease + block recovery since the block goes into UNDER_RECOVERY state even though no one is recovering it actively. Please let me know if this makes sense. If yes, whether we should move forward with a) or b).

      Thanks

      1. 4721-trunk-v4.patch
        23 kB
        Varun Sharma
      2. 4721-branch2.patch
        23 kB
        Varun Sharma
      3. 4721-trunk-v3.patch
        22 kB
        Varun Sharma
      4. 4721-trunk-v2.patch
        20 kB
        Varun Sharma
      5. 4721-trunk.patch
        20 kB
        Varun Sharma
      6. 4721-v8.patch
        21 kB
        Varun Sharma
      7. 4721-v7.patch
        21 kB
        Varun Sharma
      8. 4721-v6.patch
        21 kB
        Varun Sharma
      9. 4721-v5.patch
        18 kB
        Varun Sharma
      10. 4721-v4.patch
        6 kB
        Varun Sharma
      11. 4721-v3.patch
        6 kB
        Varun Sharma
      12. 4721-v2.patch
        11 kB
        Varun Sharma

        Issue Links

          Activity

          Hide
          Varun Sharma added a comment -

          As a 2nd action item, it would also be nice to have the ability to skip "stale nodes" for reconciliation at the Primary DN. Basically, we have the following happen currently:

          1) 1st recoverLease call from HBase - bound to fail since it picks Bad DN as primary
          2) 2nd recoverLease call from HBase - picks correct DN as primary. At the primary DN, we still try to reconcile blocks against the stale/bad DN causing the recovery to take as much as dfs.socket.timeout (default 60 seconds)

          If we avoid picking stale nodes (nodes with lost heartbeat for say, 20-30 seconds) and also avoid them during the reconciliation phase. That will enable lease recovery to be a lot faster...

          Show
          Varun Sharma added a comment - As a 2nd action item, it would also be nice to have the ability to skip "stale nodes" for reconciliation at the Primary DN. Basically, we have the following happen currently: 1) 1st recoverLease call from HBase - bound to fail since it picks Bad DN as primary 2) 2nd recoverLease call from HBase - picks correct DN as primary. At the primary DN, we still try to reconcile blocks against the stale/bad DN causing the recovery to take as much as dfs.socket.timeout (default 60 seconds) If we avoid picking stale nodes (nodes with lost heartbeat for say, 20-30 seconds) and also avoid them during the reconciliation phase. That will enable lease recovery to be a lot faster...
          Hide
          Varun Sharma added a comment -

          I attached a rough patch which
          a) Avoids a stale node from being chosen as the primary datanode to do the recovery
          b) Skips over the stale nodes as the recovery locations when passing them to the primary datanode

          Earlier - recovery takes as long as dfs.socket.timeout but now it takes roughly 1-2 seconds (which is basically the heartbeat interval). Here are the NN logs on a test where we suspend a HBase region server and the HDFS datanode. Block is finalized within 1 second. The patch is rough and I am looking for comments.

          2013-04-21 23:31:40,036 INFO BlockStateChange: BLOCK* blk_1083189771170117282_5999

          {blockUCState=UNDER_RECOVERY, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.170.15.97:50010|RBW], ReplicaUnderConstruction[10.170.6.131:50010|RBW], ReplicaUnderConstruction[10.157.42.32:50010|RBW]]}

          skipping stale node for primary, node=10.170.15.97:50010
          2013-04-21 23:31:40,036 INFO BlockStateChange: BLOCK* blk_1083189771170117282_5999

          {blockUCState=UNDER_RECOVERY, primaryNodeIndex=1, replicas=[ReplicaUnderConstruction[10.170.15.97:50010|RBW], ReplicaUnderConstruction[10.170.6.131:50010|RBW], ReplicaUnderConstruction[10.157.42.32:50010|RBW]]}

          recovery started, primary=10.170.6.131:50010
          2013-04-21 23:31:40,036 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File /hbase/.logs/ip-10-170-15-97.ec2.internal,60020,1366586774505-splitting/ip-10-170-15-97.ec2.internal%2C60020%2C1366586774505.1366586775415 has not been closed. Lease recovery is in progress. RecoveryId = 6148 for block blk_1083189771170117282_5999

          {blockUCState=UNDER_RECOVERY, primaryNodeIndex=1, replicas=[ReplicaUnderConstruction[10.170.15.97:50010|RBW], ReplicaUnderConstruction[10.170.6.131:50010|RBW], ReplicaUnderConstruction[10.157.42.32:50010|RBW]]}

          2013-04-21 23:31:41,280 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.170.6.131:50010 is added to blk_1083189771170117282_5999

          {blockUCState=UNDER_RECOVERY, primaryNodeIndex=1, replicas=[ReplicaUnderConstruction[10.170.15.97:50010|RBW], ReplicaUnderConstruction[10.170.6.131:50010|RBW], ReplicaUnderConstruction[10.157.42.32:50010|RBW]]}

          size 0
          2013-04-21 23:31:41,282 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.157.42.32:50010 is added to blk_1083189771170117282_5999

          {blockUCState=UNDER_RECOVERY, primaryNodeIndex=1, replicas=[ReplicaUnderConstruction[10.170.15.97:50010|RBW], ReplicaUnderConstruction[10.170.6.131:50010|RBW], ReplicaUnderConstruction[10.157.42.32:50010|RBW]]}

          size 0
          2013-04-21 23:31:41,282 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: commitBlockSynchronization(lastblock=BP-889095791-10.171.1.40-1366491606582:blk_1083189771170117282_5999, newgenerationstamp=6148, newlength=51174873, newtargets=[10.170.6.131:50010, 10.157.42.32:50010], closeFile=true, deleteBlock=false)
          2013-04-21 23:31:41,290 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: commitBlockSynchronization(newblock=BP-889095791-10.171.1.40-1366491606582:blk_1083189771170117282_5999, file=/hbase/.logs/ip-10-170-15-97.ec2.internal,60020,1366586774505-splitting/ip-10-170-15-97.ec2.internal%2C60020%2C1366586774505.1366586775415, newgenerationstamp=6148, newlength=51174873, newtargets=[10.170.6.131:50010, 10.157.42.32:50010]) successful

          Show
          Varun Sharma added a comment - I attached a rough patch which a) Avoids a stale node from being chosen as the primary datanode to do the recovery b) Skips over the stale nodes as the recovery locations when passing them to the primary datanode Earlier - recovery takes as long as dfs.socket.timeout but now it takes roughly 1-2 seconds (which is basically the heartbeat interval). Here are the NN logs on a test where we suspend a HBase region server and the HDFS datanode. Block is finalized within 1 second. The patch is rough and I am looking for comments. 2013-04-21 23:31:40,036 INFO BlockStateChange: BLOCK* blk_1083189771170117282_5999 {blockUCState=UNDER_RECOVERY, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.170.15.97:50010|RBW], ReplicaUnderConstruction[10.170.6.131:50010|RBW], ReplicaUnderConstruction[10.157.42.32:50010|RBW]]} skipping stale node for primary, node=10.170.15.97:50010 2013-04-21 23:31:40,036 INFO BlockStateChange: BLOCK* blk_1083189771170117282_5999 {blockUCState=UNDER_RECOVERY, primaryNodeIndex=1, replicas=[ReplicaUnderConstruction[10.170.15.97:50010|RBW], ReplicaUnderConstruction[10.170.6.131:50010|RBW], ReplicaUnderConstruction[10.157.42.32:50010|RBW]]} recovery started, primary=10.170.6.131:50010 2013-04-21 23:31:40,036 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File /hbase/.logs/ip-10-170-15-97.ec2.internal,60020,1366586774505-splitting/ip-10-170-15-97.ec2.internal%2C60020%2C1366586774505.1366586775415 has not been closed. Lease recovery is in progress. RecoveryId = 6148 for block blk_1083189771170117282_5999 {blockUCState=UNDER_RECOVERY, primaryNodeIndex=1, replicas=[ReplicaUnderConstruction[10.170.15.97:50010|RBW], ReplicaUnderConstruction[10.170.6.131:50010|RBW], ReplicaUnderConstruction[10.157.42.32:50010|RBW]]} 2013-04-21 23:31:41,280 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.170.6.131:50010 is added to blk_1083189771170117282_5999 {blockUCState=UNDER_RECOVERY, primaryNodeIndex=1, replicas=[ReplicaUnderConstruction[10.170.15.97:50010|RBW], ReplicaUnderConstruction[10.170.6.131:50010|RBW], ReplicaUnderConstruction[10.157.42.32:50010|RBW]]} size 0 2013-04-21 23:31:41,282 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.157.42.32:50010 is added to blk_1083189771170117282_5999 {blockUCState=UNDER_RECOVERY, primaryNodeIndex=1, replicas=[ReplicaUnderConstruction[10.170.15.97:50010|RBW], ReplicaUnderConstruction[10.170.6.131:50010|RBW], ReplicaUnderConstruction[10.157.42.32:50010|RBW]]} size 0 2013-04-21 23:31:41,282 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: commitBlockSynchronization(lastblock=BP-889095791-10.171.1.40-1366491606582:blk_1083189771170117282_5999, newgenerationstamp=6148, newlength=51174873, newtargets= [10.170.6.131:50010, 10.157.42.32:50010] , closeFile=true, deleteBlock=false) 2013-04-21 23:31:41,290 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: commitBlockSynchronization(newblock=BP-889095791-10.171.1.40-1366491606582:blk_1083189771170117282_5999, file=/hbase/.logs/ip-10-170-15-97.ec2.internal,60020,1366586774505-splitting/ip-10-170-15-97.ec2.internal%2C60020%2C1366586774505.1366586775415, newgenerationstamp=6148, newlength=51174873, newtargets= [10.170.6.131:50010, 10.157.42.32:50010] ) successful
          Hide
          Ted Yu added a comment -
          +  private volatile boolean avoidStaleNodesForRecovery = true;
          

          The above flag is only assigned once. Did you intend to introduce a config param to enable this feature ?

          +  private volatile long staleInterval = 20000;
          

          DatanodeManager has a field, staleInterval. Should the above member be aligned with the field of DatanodeManager ?

          Show
          Ted Yu added a comment - + private volatile boolean avoidStaleNodesForRecovery = true ; The above flag is only assigned once. Did you intend to introduce a config param to enable this feature ? + private volatile long staleInterval = 20000; DatanodeManager has a field, staleInterval. Should the above member be aligned with the field of DatanodeManager ?
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12579764/4721-hadoop2.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4282//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4282//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12579764/4721-hadoop2.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4282//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4282//console This message is automatically generated.
          Hide
          Varun Sharma added a comment -

          Ted Yu

          Thanks Ted !

          I plan to have config param to avoid stale nodes for recovery and yes the stale node parameter could be shared with Datamanager.

          This was a rough patch - I am a little unsure on what to do if there are too few non stale nodes to recover from - we take care of the 0 case but what to do if there is only 1 active node. I will add tests and make these changes once someone from HDFS community can take a look...

          Show
          Varun Sharma added a comment - Ted Yu Thanks Ted ! I plan to have config param to avoid stale nodes for recovery and yes the stale node parameter could be shared with Datamanager. This was a rough patch - I am a little unsure on what to do if there are too few non stale nodes to recover from - we take care of the 0 case but what to do if there is only 1 active node. I will add tests and make these changes once someone from HDFS community can take a look...
          Hide
          Ted Yu added a comment -

          In FSNamesystem#internalReleaseLease():

              case UNDER_CONSTRUCTION:
              case UNDER_RECOVERY:
                final BlockInfoUnderConstruction uc = (BlockInfoUnderConstruction)lastBlock;
                // setup the last block locations from the blockManager if not known
                if (uc.getNumExpectedLocations() == 0) {
                  uc.setExpectedLocations(blockManager.getNodes(lastBlock));
                }
                // start recovery of the last block for this file
                long blockRecoveryId = nextGenerationStamp();
                lease = reassignLease(lease, src, recoveryLeaseHolder, pendingFile);
          

          Can we distinguish UNDER_RECOVERY from UNDER_CONSTRUCTION so that the problem described by HBASE-8389 can be avoided ?

          Show
          Ted Yu added a comment - In FSNamesystem#internalReleaseLease(): case UNDER_CONSTRUCTION: case UNDER_RECOVERY: final BlockInfoUnderConstruction uc = (BlockInfoUnderConstruction)lastBlock; // setup the last block locations from the blockManager if not known if (uc.getNumExpectedLocations() == 0) { uc.setExpectedLocations(blockManager.getNodes(lastBlock)); } // start recovery of the last block for this file long blockRecoveryId = nextGenerationStamp(); lease = reassignLease(lease, src, recoveryLeaseHolder, pendingFile); Can we distinguish UNDER_RECOVERY from UNDER_CONSTRUCTION so that the problem described by HBASE-8389 can be avoided ?
          Hide
          Nicolas Liochon added a comment -

          b) We sort the replicas in order of last heart beat and always pick the ones which gave the most recent heart beat

          I like this one. It will save us when the hbase recovery starts before the datanode is marked stale.

          Show
          Nicolas Liochon added a comment - b) We sort the replicas in order of last heart beat and always pick the ones which gave the most recent heart beat I like this one. It will save us when the hbase recovery starts before the datanode is marked stale.
          Hide
          Varun Sharma added a comment -

          nicholas,

          Good point. I will create a patch which has that.

          Ted,

          As far as I know, the last WAL block is in UNDER_CONSTRUCTION state and we call internalReleaseLease. When we call internalReleaseLease, it goes into UNDER_RECOVERY state and the namenode enqueues it for recovery to the primary DataNode. Namenode actually communicates this enqueued recovery block to the primary datanode during heart beat. Now lets say the primary never heart beated or was lost. We ideally want to retry recoverLease in that case if the lease has not been recovered. I guess the best way around it would be to have an API for checking if lease is recovered or not. Right now, we have an API to do that but it also unconditionally enqueues a block for recovery.

          Show
          Varun Sharma added a comment - nicholas, Good point. I will create a patch which has that. Ted, As far as I know, the last WAL block is in UNDER_CONSTRUCTION state and we call internalReleaseLease. When we call internalReleaseLease, it goes into UNDER_RECOVERY state and the namenode enqueues it for recovery to the primary DataNode. Namenode actually communicates this enqueued recovery block to the primary datanode during heart beat. Now lets say the primary never heart beated or was lost. We ideally want to retry recoverLease in that case if the lease has not been recovered. I guess the best way around it would be to have an API for checking if lease is recovered or not. Right now, we have an API to do that but it also unconditionally enqueues a block for recovery.
          Hide
          Ted Yu added a comment -

          I created HDFS-4724 for the new API that allows client to query recovery progress.

          Show
          Ted Yu added a comment - I created HDFS-4724 for the new API that allows client to query recovery progress.
          Hide
          Nicolas Liochon added a comment -

          The algo as I understand it is:
          namenode sends, with the heartbeat, the request to start the recovery to one datanode. The recovery is finished when the file is no longer in construction.
          the chosen datanode will call sequentially all other datanodes in the pipeline, including itself, to synchronize on the block size.
          the chosen datanode will then update the namenode, and the file won't be anymore in construction.

          Issues is: if one of the DN is dead, we will have to wait for a few socket timeout or more, as we will try to contact it.
          In this JIRA, it's fixed by skipping the stale datanode. But:

          • if the server is only stale, it won't be participating to the recovery (not sure of the impact. If it's acceptable, it's great).
          • if the server is dead, we're done for a wait of at least 30s.

          Would it be possible to consider the file as not in construction as soon as the chosen datanode has updated it's own replica?
          Then we would not depend anymore on the others: with one datanode we would be done.

          Show
          Nicolas Liochon added a comment - The algo as I understand it is: namenode sends, with the heartbeat, the request to start the recovery to one datanode. The recovery is finished when the file is no longer in construction. the chosen datanode will call sequentially all other datanodes in the pipeline, including itself, to synchronize on the block size. the chosen datanode will then update the namenode, and the file won't be anymore in construction. Issues is: if one of the DN is dead, we will have to wait for a few socket timeout or more, as we will try to contact it. In this JIRA, it's fixed by skipping the stale datanode. But: if the server is only stale, it won't be participating to the recovery (not sure of the impact. If it's acceptable, it's great). if the server is dead, we're done for a wait of at least 30s. Would it be possible to consider the file as not in construction as soon as the chosen datanode has updated it's own replica? Then we would not depend anymore on the others: with one datanode we would be done.
          Hide
          Varun Sharma added a comment -

          Attached v2 which does the following:

          1) Introduces a new config variable
          dfs.namenode.stale.datanode.interval

          When true we do the following:
          a) Choose the most recent heartbeating datanode to do the recovery. For subsequent recoveries choose a primary datanode which is not the same as the previous data node but is a recent heart beating data node. Another option here would be to simply avoid stale data nodes as the primary data node. Otherwise the first recoverLease is pretty much a noop
          b) Avoid stale datanodes from the block recovery.

          Please provide any feedback...

          Show
          Varun Sharma added a comment - Attached v2 which does the following: 1) Introduces a new config variable dfs.namenode.stale.datanode.interval When true we do the following: a) Choose the most recent heartbeating datanode to do the recovery. For subsequent recoveries choose a primary datanode which is not the same as the previous data node but is a recent heart beating data node. Another option here would be to simply avoid stale data nodes as the primary data node. Otherwise the first recoverLease is pretty much a noop b) Avoid stale datanodes from the block recovery. Please provide any feedback...
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12579904/4721-v2.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          -1 javac. The applied patch generated 1367 javac compiler warnings (more than the trunk's current 1366 warnings).

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4285//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/4285//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
          Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/4285//artifact/trunk/patchprocess/diffJavacWarnings.txt
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4285//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12579904/4721-v2.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 javac . The applied patch generated 1367 javac compiler warnings (more than the trunk's current 1366 warnings). +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4285//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/4285//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/4285//artifact/trunk/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4285//console This message is automatically generated.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          I think we don't need the new conf key DFS_NAMENODE_AVOID_STALE_DATANODE_FOR_RECOVERY since the stale datanodes should be avoided in any case.

          I also think that we could simplify the patch: currently, BlockInfoUnderConstruction.initializeBlockRecovery chooses the first live datanode as primary. How about changing it to choose the least stale datanode, i.e. the one with the latest last heartbeat to namenode?

          Show
          Tsz Wo Nicholas Sze added a comment - I think we don't need the new conf key DFS_NAMENODE_AVOID_STALE_DATANODE_FOR_RECOVERY since the stale datanodes should be avoided in any case. I also think that we could simplify the patch: currently, BlockInfoUnderConstruction.initializeBlockRecovery chooses the first live datanode as primary. How about changing it to choose the least stale datanode, i.e. the one with the latest last heartbeat to namenode?
          Hide
          Varun Sharma added a comment -

          Hi Nicholas,

          Thanks for taking a look...

          In my v2 patch, I am sorting the datanode in BlockInfoUnderConstruction.initializeBlockRecovery() by lastUpdate and then picking the 0-th element from the sorted list. I actually want to remember which node was tried the first time because if a subsequent block recovery was called - we don't want to have the same DN do the recovery again. So, basically choose the node which gave the latest heartbeat and was not the node which was issued the previous recovery.

          So, you think we should unconditionally remove stale nodes (currently 30 second stale interval) - I can surely attach a patch which has that.

          Show
          Varun Sharma added a comment - Hi Nicholas, Thanks for taking a look... In my v2 patch, I am sorting the datanode in BlockInfoUnderConstruction.initializeBlockRecovery() by lastUpdate and then picking the 0-th element from the sorted list. I actually want to remember which node was tried the first time because if a subsequent block recovery was called - we don't want to have the same DN do the recovery again. So, basically choose the node which gave the latest heartbeat and was not the node which was issued the previous recovery. So, you think we should unconditionally remove stale nodes (currently 30 second stale interval) - I can surely attach a patch which has that.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > So, you think we should unconditionally remove stale nodes ...

          No, don't remove it. Just try the datanode with the latest heartbeat timestamp without repeating the same datanode. Instead of storing the index, we need some trick to remember the selected datanodes such as setting the i-th bit when the i-th datanode is selected.

          Show
          Tsz Wo Nicholas Sze added a comment - > So, you think we should unconditionally remove stale nodes ... No, don't remove it. Just try the datanode with the latest heartbeat timestamp without repeating the same datanode. Instead of storing the index, we need some trick to remember the selected datanodes such as setting the i-th bit when the i-th datanode is selected.
          Hide
          Varun Sharma added a comment -

          Hi Nicholas,

          I attached v3 which does that. Basically, I memorize the last datanode selected as primary datanode and then sort according to last updated timestamp. Choose datanode which is not same as the earlier datanode.

          However there is a second twist to it. When we send recovery lease command to the non-stale datanode, we send a list of replicas to recover the block from. I think if there is a single node failure, we probably want to avoid recovering from a stale node replica as well - if you look at the DatanodeManager changes, I am making sure that we instruct the Primary Data Node to only recover from non stale data nodes (which will be 2 nodes in case of a single data node failure).

          Thanks
          Varun

          Show
          Varun Sharma added a comment - Hi Nicholas, I attached v3 which does that. Basically, I memorize the last datanode selected as primary datanode and then sort according to last updated timestamp. Choose datanode which is not same as the earlier datanode. However there is a second twist to it. When we send recovery lease command to the non-stale datanode, we send a list of replicas to recover the block from. I think if there is a single node failure, we probably want to avoid recovering from a stale node replica as well - if you look at the DatanodeManager changes, I am making sure that we instruct the Primary Data Node to only recover from non stale data nodes (which will be 2 nodes in case of a single data node failure). Thanks Varun
          Hide
          Andrew Wang added a comment -

          I took a look at the latest patch, some comments:

              if (heartbeatExpireInterval > 0 && ...) {
          

          Is this additional check necessary in DatanodeManager? This shouldn't ever be <= 0, and warning here is a good idea if it is.

          I think only trying recovery on non-stale DNs is okay, but someone else should ring in.

          Also needs test cases. You can take a look at the existing stale datanode test cases for some examples.

          Show
          Andrew Wang added a comment - I took a look at the latest patch, some comments: if (heartbeatExpireInterval > 0 && ...) { Is this additional check necessary in DatanodeManager? This shouldn't ever be <= 0, and warning here is a good idea if it is. I think only trying recovery on non-stale DNs is okay, but someone else should ring in. Also needs test cases. You can take a look at the existing stale datanode test cases for some examples.
          Hide
          Varun Sharma added a comment -

          Thanks Andrew for taking a look !

          I just attached v3 which should not have the additional check (that is just some dangling pointer). Also, I made a change to remove the config variable and select non stale nodes unconditionally in v3.

          I will add test cases once we concur, this is along the right lines (will take a look at the existing stale node test cases)

          Show
          Varun Sharma added a comment - Thanks Andrew for taking a look ! I just attached v3 which should not have the additional check (that is just some dangling pointer). Also, I made a change to remove the config variable and select non stale nodes unconditionally in v3. I will add test cases once we concur, this is along the right lines (will take a look at the existing stale node test cases)
          Hide
          Andrew Wang added a comment -

          Regarding repeated recoveries, I had the same thinking initially as Tsz Wo Nicholas Sze, but I think the v2 behavior does make sense. If a DN 1 fails to recover, DN 2 tries and also fails, and DN 1 has since re-heartbeated and is at the front of the list, why not try it again? After all, the existing RR strategy wraps around while looping through alive nodes.

          Show
          Andrew Wang added a comment - Regarding repeated recoveries, I had the same thinking initially as Tsz Wo Nicholas Sze , but I think the v2 behavior does make sense. If a DN 1 fails to recover, DN 2 tries and also fails, and DN 1 has since re-heartbeated and is at the front of the list, why not try it again? After all, the existing RR strategy wraps around while looping through alive nodes.
          Hide
          Varun Sharma added a comment -

          Yeah, so currently, we just try to avoid trying the same datanode for 2 consecutive recoveries if we can - instead we try DN1, then DN2 and then we would try DN1 again. This is similar to the round robin order DN1->DN2->DN3 - just that it eliminates DN3 from the picture if it is stale.

          Show
          Varun Sharma added a comment - Yeah, so currently, we just try to avoid trying the same datanode for 2 consecutive recoveries if we can - instead we try DN1, then DN2 and then we would try DN1 again. This is similar to the round robin order DN1->DN2->DN3 - just that it eliminates DN3 from the picture if it is stale.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          It is better to try all datanodes once in some order and then retry all of them again (in some other order.) If some bad datanodes somehow fail lease recovery but able to do heartbeat, then we may keep retrying them but starve the other good datanodes. Datanodes with long heartbeat a interval may just be busy but nothing wrong.

          Show
          Tsz Wo Nicholas Sze added a comment - It is better to try all datanodes once in some order and then retry all of them again (in some other order.) If some bad datanodes somehow fail lease recovery but able to do heartbeat, then we may keep retrying them but starve the other good datanodes. Datanodes with long heartbeat a interval may just be busy but nothing wrong.
          Hide
          Varun Sharma added a comment -

          Hi Nicholas,

          I think thats a good point. Do you think we could try say maybe twice (> n/2 replicas times) using the sorted method and then try using round robin after wards ?

          Varun

          Show
          Varun Sharma added a comment - Hi Nicholas, I think thats a good point. Do you think we could try say maybe twice (> n/2 replicas times) using the sorted method and then try using round robin after wards ? Varun
          Hide
          Ted Yu added a comment -

          I think the actual failure scenarios in production cluster are hard to predict.
          What if we introduce some randomization in trying out the Data Nodes ?

          Put in another way, it might be beneficial to give users more than one policy in this regard.

          Just my two cents.

          Show
          Ted Yu added a comment - I think the actual failure scenarios in production cluster are hard to predict. What if we introduce some randomization in trying out the Data Nodes ? Put in another way, it might be beneficial to give users more than one policy in this regard. Just my two cents.
          Hide
          Varun Sharma added a comment -

          I think in 99 % of cases, when we see no heart beats for 30 seconds - the datanode is dead or there is a serious issue (30 seconds is like an eternity).

          IMO, We should try our best bets first (recently heart beating datanodes), say upto two times and then go random/round robin - current policy is random/round robin only. As it stands today, if a data node dies and recover lease is called "only once" - the lease recovery does not happen for 1 hour.

          Show
          Varun Sharma added a comment - I think in 99 % of cases, when we see no heart beats for 30 seconds - the datanode is dead or there is a serious issue (30 seconds is like an eternity). IMO, We should try our best bets first (recently heart beating datanodes), say upto two times and then go random/round robin - current policy is random/round robin only. As it stands today, if a data node dies and recover lease is called "only once" - the lease recovery does not happen for 1 hour.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12579932/4721-v3.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          -1 javac. The applied patch generated 1367 javac compiler warnings (more than the trunk's current 1366 warnings).

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover
          org.apache.hadoop.hdfs.TestDFSClientRetries

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4287//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/4287//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
          Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/4287//artifact/trunk/patchprocess/diffJavacWarnings.txt
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4287//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12579932/4721-v3.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 javac . The applied patch generated 1367 javac compiler warnings (more than the trunk's current 1366 warnings). +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover org.apache.hadoop.hdfs.TestDFSClientRetries +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4287//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/4287//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/4287//artifact/trunk/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4287//console This message is automatically generated.
          Hide
          Varun Sharma added a comment -

          Attached patch v4...

          This one is a lot cleaner and smaller than the previous one. The logic is as follows:

          a) If this is the 3rd or greater recovery attempt, round robin around the data nodes.
          b) If this is one of 1st two attempts, round robin around the data nodes and pick the one with the minimum heart beat.

          Show
          Varun Sharma added a comment - Attached patch v4... This one is a lot cleaner and smaller than the previous one. The logic is as follows: a) If this is the 3rd or greater recovery attempt, round robin around the data nodes. b) If this is one of 1st two attempts, round robin around the data nodes and pick the one with the minimum heart beat.
          Hide
          Varun Sharma added a comment -

          Sorry, there was some issue with the previous attachment. Reattached v4 now and it looks a lot simpler.

          Thanks
          Varun

          Show
          Varun Sharma added a comment - Sorry, there was some issue with the previous attachment. Reattached v4 now and it looks a lot simpler. Thanks Varun
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12579965/4721-v4.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          -1 javac. The applied patch generated 1367 javac compiler warnings (more than the trunk's current 1366 warnings).

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.server.datanode.TestBlockRecovery

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4289//testReport/
          Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/4289//artifact/trunk/patchprocess/diffJavacWarnings.txt
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4289//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12579965/4721-v4.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 javac . The applied patch generated 1367 javac compiler warnings (more than the trunk's current 1366 warnings). +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.datanode.TestBlockRecovery +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4289//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/4289//artifact/trunk/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4289//console This message is automatically generated.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12580008/4721-v5.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified test files.

          -1 javac. The patch appears to cause the build to fail.

          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4293//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12580008/4721-v5.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. -1 javac . The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4293//console This message is automatically generated.
          Hide
          stack added a comment -

          Patch looks nice and clean to me. +1

          Minor Nit: Should this be info-level? + LOG.info("Skipping stale datanode for recovery: " + expectedLocations[i]);

          TestBlockInfoUnderConstruction is a nice test. Not an issue but if you have to redo patch, maybe call System.currentTimeMillis() once only and reuse the value rather than call it each time. It looks like the differences in heartbeat numbers are such that it should not be a problem calling System.currentTimeMillis() every time so no biggie.

          TestHeartbeatHandling also looks good on cursory review.

          Show
          stack added a comment - Patch looks nice and clean to me. +1 Minor Nit: Should this be info-level? + LOG.info("Skipping stale datanode for recovery: " + expectedLocations [i] ); TestBlockInfoUnderConstruction is a nice test. Not an issue but if you have to redo patch, maybe call System.currentTimeMillis() once only and reuse the value rather than call it each time. It looks like the differences in heartbeat numbers are such that it should not be a problem calling System.currentTimeMillis() every time so no biggie. TestHeartbeatHandling also looks good on cursory review.
          Hide
          Varun Sharma added a comment -

          Thanks Stack

          I personally prefer at least info() level logging because debug logs are typically disabled and its hard to trace down if the stale node stuff actually worked or it did not work. Though I am not sure from your comment if you wanted me to bump up the log level or bring it down.

          Tsz Wo Nicholas Sze
          Could you take another look at the v5 patch - it has updated tests.

          The build seems to have failed but the console should me nothing. It builds and passes tests locally.

          https://builds.apache.org/job/PreCommit-HDFS-Build/4293//console

          Should i reattach the patch.

          Thanks
          Varun

          Show
          Varun Sharma added a comment - Thanks Stack I personally prefer at least info() level logging because debug logs are typically disabled and its hard to trace down if the stale node stuff actually worked or it did not work. Though I am not sure from your comment if you wanted me to bump up the log level or bring it down. Tsz Wo Nicholas Sze Could you take another look at the v5 patch - it has updated tests. The build seems to have failed but the console should me nothing. It builds and passes tests locally. https://builds.apache.org/job/PreCommit-HDFS-Build/4293//console Should i reattach the patch. Thanks Varun
          Hide
          Nicolas Liochon added a comment -

          Would it make sense, in Datanode#recoverBlock, to set the retry to only 1?
          This would allow to continue with the remaining blocks even if we hit a bad datanode within the datanode recovery...

          Show
          Nicolas Liochon added a comment - Would it make sense, in Datanode#recoverBlock, to set the retry to only 1? This would allow to continue with the remaining blocks even if we hit a bad datanode within the datanode recovery...
          Hide
          stack added a comment -

          Varun Sharma I was suggesting the info logged not worth of INFO, that instead it be DEBUG, but I am fine w/ your reasoning. +1 from me on patch.

          Show
          stack added a comment - Varun Sharma I was suggesting the info logged not worth of INFO, that instead it be DEBUG, but I am fine w/ your reasoning. +1 from me on patch.
          Hide
          Varun Sharma added a comment -

          Nicolas Liochon
          From what I think we may really have a scenario where we retry twice within the primary DN but I could be wrong and there could be some race conditions. I think the current behaviour is (from my look at the logs):
          1) Try to recovery block from all data nodes
          2) Hit a bad datanode - hit socket timeout (dfs.socket.timeout) which is configurable.
          3) Simply recover data block from the good datanodes and continue

          Also, since primary DN is up and running, it always has the data block, so there is at least one good datanode. I think optimizing recovery of > 1 block within the primary DN could be in another JIRA. This one should be focused on how to choose a) the primary datanode and b) the participating datanodes, at the namenode.

          Varun

          Show
          Varun Sharma added a comment - Nicolas Liochon From what I think we may really have a scenario where we retry twice within the primary DN but I could be wrong and there could be some race conditions. I think the current behaviour is (from my look at the logs): 1) Try to recovery block from all data nodes 2) Hit a bad datanode - hit socket timeout (dfs.socket.timeout) which is configurable. 3) Simply recover data block from the good datanodes and continue Also, since primary DN is up and running, it always has the data block, so there is at least one good datanode. I think optimizing recovery of > 1 block within the primary DN could be in another JIRA. This one should be focused on how to choose a) the primary datanode and b) the participating datanodes, at the namenode. Varun
          Hide
          Nicolas Liochon added a comment -

          May be there is something I haven't understood in the patch. You're now connecting to a single datanode during the recovery (i.e. the primary one)? If not, we will have the 45 retries during step 2), and it blocks step 3).

          Show
          Nicolas Liochon added a comment - May be there is something I haven't understood in the patch. You're now connecting to a single datanode during the recovery (i.e. the primary one)? If not, we will have the 45 retries during step 2), and it blocks step 3).
          Hide
          Varun Sharma added a comment -

          Yep, this patch does not fix that behaviour on the DataNode.

          There is one setting for everyone for the connect retries and timeout and AFAIK, that setting is not configurable on a case by case basis - I am talking about the hadoop ipc client. The default is 45 retries and 20 seconds. The way to get around it is to configure your HDFS cluster to retry maybe 3 times and with timeout of 1 second. Also tune down your dfs.socket.timeout to 3 seconds. With this patch, if the bad DN also stopped heart beating, you will end up avoiding it during recovery at primary DN but if its still heartbeating but bad, then you need to tune your timeouts.

          But as I said, this is beyond the scope of this issue and I want to keep this focused on the namenode side of things.

          Show
          Varun Sharma added a comment - Yep, this patch does not fix that behaviour on the DataNode. There is one setting for everyone for the connect retries and timeout and AFAIK, that setting is not configurable on a case by case basis - I am talking about the hadoop ipc client. The default is 45 retries and 20 seconds. The way to get around it is to configure your HDFS cluster to retry maybe 3 times and with timeout of 1 second. Also tune down your dfs.socket.timeout to 3 seconds. With this patch, if the bad DN also stopped heart beating, you will end up avoiding it during recovery at primary DN but if its still heartbeating but bad, then you need to tune your timeouts. But as I said, this is beyond the scope of this issue and I want to keep this focused on the namenode side of things.
          Hide
          Nicolas Liochon added a comment -

          You can use a specific configuration for this call, created on the fly.
          No problem if you prefer it to be on another jira.

          Show
          Nicolas Liochon added a comment - You can use a specific configuration for this call, created on the fly. No problem if you prefer it to be on another jira.
          Hide
          Varun Sharma added a comment -

          Yeah, I don't want to combine the NN and the DN into the same patch - this patch is more about who does the recovery and who are the participants in the recovery. Your point is more about the recovery itself - how do we fail fast during block recovery.

          Btw, do you know why the build failed ? I see nothing on the console - should I just reattach the patch ?

          Varun

          Show
          Varun Sharma added a comment - Yeah, I don't want to combine the NN and the DN into the same patch - this patch is more about who does the recovery and who are the participants in the recovery. Your point is more about the recovery itself - how do we fail fast during block recovery. Btw, do you know why the build failed ? I see nothing on the console - should I just reattach the patch ? Varun
          Hide
          Nicolas Liochon added a comment -

          Btw, do you know why the build failed ? I see nothing on the console - should I just reattach the patch ?

          Info is there: https://builds.apache.org/job/PreCommit-HDFS-Build/4293/artifact/trunk/patchprocess/patchJavacWarnings.txt
          Maybe the patch was applied on trunk while you wrote it for branch 2?

          Show
          Nicolas Liochon added a comment - Btw, do you know why the build failed ? I see nothing on the console - should I just reattach the patch ? Info is there: https://builds.apache.org/job/PreCommit-HDFS-Build/4293/artifact/trunk/patchprocess/patchJavacWarnings.txt Maybe the patch was applied on trunk while you wrote it for branch 2?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          For the v5 patch,

          • We don't need to create recoveryLocations in DatanodeManager, i.e. we don't need to change DatanodeManager at all. If a stale datanode is indeed dead, the primary won't be able to connect to it. Then, the primary will skip it.
          • The new field numRecoveryAttempts is never reset to zero. It is a problem when there are two or more independent recoveries. I suggest to make the algorithm simple and remove numRecoveryAttempts.
          Show
          Tsz Wo Nicholas Sze added a comment - For the v5 patch, We don't need to create recoveryLocations in DatanodeManager, i.e. we don't need to change DatanodeManager at all. If a stale datanode is indeed dead, the primary won't be able to connect to it. Then, the primary will skip it. The new field numRecoveryAttempts is never reset to zero. It is a problem when there are two or more independent recoveries. I suggest to make the algorithm simple and remove numRecoveryAttempts.
          Hide
          Varun Sharma added a comment -

          Hi Nicholas,

          Thanks for your comments. My responses here:

          1) Currently, the primary will try to connect to the dead datanode and timeout using the dfs.socket.timeout or ipc.client.max.retries which will introduce additional overhead when it comes to block recovery. We already introduced a patch HDFS 3912 where we avoid placing replicas on stale nodes (stale with > 30 seconds). For people who operate HDFS for online use cases, its good to provide an option of failing fast (like HDFS 3912) for block recovery. The other option is to tune down dfs.socket.timeout and ipc.client.connect timeout and retries to really low numbers but then that affects cluster wide settings.
          Let me know what you think - the idea here is that if a node is not heartbeating for last 30 seconds, then its not good for block recovery anymore if the block needs to be recovered quickly. With the change, we are able to get block recoveries within 1 second but without this change it takes 60 seconds to recover a block.

          2) Is there another approach you had in mind, I just thought of an approach where we go to most recent heart beater first twice and then go round robin.

          Thanks
          Varun

          Show
          Varun Sharma added a comment - Hi Nicholas, Thanks for your comments. My responses here: 1) Currently, the primary will try to connect to the dead datanode and timeout using the dfs.socket.timeout or ipc.client.max.retries which will introduce additional overhead when it comes to block recovery. We already introduced a patch HDFS 3912 where we avoid placing replicas on stale nodes (stale with > 30 seconds). For people who operate HDFS for online use cases, its good to provide an option of failing fast (like HDFS 3912) for block recovery. The other option is to tune down dfs.socket.timeout and ipc.client.connect timeout and retries to really low numbers but then that affects cluster wide settings. Let me know what you think - the idea here is that if a node is not heartbeating for last 30 seconds, then its not good for block recovery anymore if the block needs to be recovered quickly. With the change, we are able to get block recoveries within 1 second but without this change it takes 60 seconds to recover a block. 2) Is there another approach you had in mind, I just thought of an approach where we go to most recent heart beater first twice and then go round robin. Thanks Varun
          Hide
          Varun Sharma added a comment -

          I have pasted the log above in the comments trail which enables the recovery to happen within 1-2 seconds by bypassing the stale node. One option can be to protect the behaviour of avoid stale nodes from participating in recovery by a configuration flag like we introduce dfs.namenode.avoid.write.stale.datanode in HDFS 3912.

          Varun

          Show
          Varun Sharma added a comment - I have pasted the log above in the comments trail which enables the recovery to happen within 1-2 seconds by bypassing the stale node. One option can be to protect the behaviour of avoid stale nodes from participating in recovery by a configuration flag like we introduce dfs.namenode.avoid.write.stale.datanode in HDFS 3912. Varun
          Hide
          Varun Sharma added a comment -

          I think additional opinions from other HDFS committers would be very helpful to come to a conclusion here. The fixes seem very trivial but we need more eyes/opinions on what is the best thing to do.

          Thanks !

          Show
          Varun Sharma added a comment - I think additional opinions from other HDFS committers would be very helpful to come to a conclusion here. The fixes seem very trivial but we need more eyes/opinions on what is the best thing to do. Thanks !
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Hi Varun,

          I see your point for (1).

          • It could cause data loss if there is only non-stale datanode. So, let's change the condition to (recoveryLocations.size() > 1).
          • new ArrayList(..) needs generic type.
          • For the LOG.info, let's print only one line at the end instead of one line per stale datanodes
          • Put the size in new DatanodeDescriptor[] instead of 0.

          For (2),

          • remove numRecoveryAttempts
          • rename primaryNodeIndex to primarySelected and consider it as a bit map: the i-th bit corresponds to the i-th datanode. For the rare case that there are >32 datanodes, choose the primary from only the first 32 datanodes.
          • Always choose the datanode with the most recent heartbeat among those datanodes that the primarySelected bit is 0. Then set the primarySelected bit to 1 for current chosen datanode. Once all bits are set, reset all to 0.
          Show
          Tsz Wo Nicholas Sze added a comment - Hi Varun, I see your point for (1). It could cause data loss if there is only non-stale datanode. So, let's change the condition to (recoveryLocations.size() > 1). new ArrayList(..) needs generic type. For the LOG.info, let's print only one line at the end instead of one line per stale datanodes Put the size in new DatanodeDescriptor[] instead of 0. For (2), remove numRecoveryAttempts rename primaryNodeIndex to primarySelected and consider it as a bit map: the i-th bit corresponds to the i-th datanode. For the rare case that there are >32 datanodes, choose the primary from only the first 32 datanodes. Always choose the datanode with the most recent heartbeat among those datanodes that the primarySelected bit is 0. Then set the primarySelected bit to 1 for current chosen datanode. Once all bits are set, reset all to 0.
          Hide
          Varun Sharma added a comment -

          Thanks !

          Tsz Wo Nicholas Sze

          I agree with all your comments. For readability purposes, I was thinking of using a simple boolean array instead of an integer so I wanted to know your preference on that.

          Also, does the initializerecovery function need to be synchronized - maybe all these calls are already synchronized ?

          Thanks
          Varun

          Show
          Varun Sharma added a comment - Thanks ! Tsz Wo Nicholas Sze I agree with all your comments. For readability purposes, I was thinking of using a simple boolean array instead of an integer so I wanted to know your preference on that. Also, does the initializerecovery function need to be synchronized - maybe all these calls are already synchronized ? Thanks Varun
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Sure, boolean array is okay since the number of recovery blocks should be small. Also, it needs to be updated when ever replicas is changed.

          initializeBlockRecovery is already synchronized by the namesystem lock.

          Show
          Tsz Wo Nicholas Sze added a comment - Sure, boolean array is okay since the number of recovery blocks should be small. Also, it needs to be updated when ever replicas is changed. initializeBlockRecovery is already synchronized by the namesystem lock.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12580008/4721-v5.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified test files.

          -1 javac. The patch appears to cause the build to fail.

          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4299//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12580008/4721-v5.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. -1 javac . The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4299//console This message is automatically generated.
          Hide
          Varun Sharma added a comment -

          Tsz Wo Nicholas Sze
          I just attached v6 with the changes you suggested. It seems to have come out cleaner than before. I rolled the boolean variable into ReplicaUnderConstruction. Could you take another look ?

          Thanks
          Varun

          Show
          Varun Sharma added a comment - Tsz Wo Nicholas Sze I just attached v6 with the changes you suggested. It seems to have come out cleaner than before. I rolled the boolean variable into ReplicaUnderConstruction. Could you take another look ? Thanks Varun
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12580190/4721-v6.patch
          against trunk revision .

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4300//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12580190/4721-v6.patch against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4300//console This message is automatically generated.
          Hide
          Tsz Wo Nicholas Sze added a comment -
          • new ArrayList(..) needs generic type, i.e. new ArrayList<DatanodeDescriptor>()
          • Put the size in new DatanodeDescriptor[], i.e. recoveryLocations.toArray(new DatanodeDescriptor[recoveryLocations.size()]),
          • chosenForRecovery may be misleading since the datanode actually is chosen as the primary datanode. We choose the non-stale datanodes for recovery. How about renaming it to chosenAsPrimary?
          • Need to update the comment "// If all nodes are stale, try recovering from all datanodes" since it may not be "all".
          Show
          Tsz Wo Nicholas Sze added a comment - new ArrayList(..) needs generic type, i.e. new ArrayList<DatanodeDescriptor>() Put the size in new DatanodeDescriptor[], i.e. recoveryLocations.toArray(new DatanodeDescriptor [recoveryLocations.size()] ), chosenForRecovery may be misleading since the datanode actually is chosen as the primary datanode. We choose the non-stale datanodes for recovery. How about renaming it to chosenAsPrimary? Need to update the comment "// If all nodes are stale, try recovering from all datanodes" since it may not be "all".
          Hide
          Varun Sharma added a comment -

          Fixed in v7 Tsz Wo Nicholas Sze

          Show
          Varun Sharma added a comment - Fixed in v7 Tsz Wo Nicholas Sze
          Hide
          Varun Sharma added a comment -

          Fixed some naming in v8. Thanks (setChosenForRecovery to setChosenAsPrimary) etc. !

          Show
          Varun Sharma added a comment - Fixed some naming in v8. Thanks (setChosenForRecovery to setChosenAsPrimary) etc. !
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12580201/4721-v8.patch
          against trunk revision .

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4301//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12580201/4721-v8.patch against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4301//console This message is automatically generated.
          Hide
          Varun Sharma added a comment -

          Is this failing because its patching against trunk instead of branch 2 ?

          Thanks
          Varun

          Show
          Varun Sharma added a comment - Is this failing because its patching against trunk instead of branch 2 ? Thanks Varun
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Varun, you are right. Please post a patch for trunk first so the Jenkins can test it.

          Show
          Tsz Wo Nicholas Sze added a comment - Varun, you are right. Please post a patch for trunk first so the Jenkins can test it.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12580215/4721-trunk.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified test files.

          -1 javac. The patch appears to cause the build to fail.

          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4303//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12580215/4721-trunk.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. -1 javac . The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4303//console This message is automatically generated.
          Hide
          Varun Sharma added a comment -

          Tsz Wo Nicholas Sze
          I had a question about lease recovery in general. What I am finding is that they are actually two different things... lease recovery can occur and the lease be reassigned to someone else while the block has not been recovered which is, its totally possible for lease to be recovered without the block being recovered and the file being closed. Is that correct ? Basically, I see that call "recoverLease" - lease is recovered and block recovery enqueued for bad replica (without this patch). Call recoverLease again and it says lease recovery was successful even though the block was never recovered. Is this known behaviour ?

          Thanks
          Varun

          Show
          Varun Sharma added a comment - Tsz Wo Nicholas Sze I had a question about lease recovery in general. What I am finding is that they are actually two different things... lease recovery can occur and the lease be reassigned to someone else while the block has not been recovered which is, its totally possible for lease to be recovered without the block being recovered and the file being closed. Is that correct ? Basically, I see that call "recoverLease" - lease is recovered and block recovery enqueued for bad replica (without this patch). Call recoverLease again and it says lease recovery was successful even though the block was never recovered. Is this known behaviour ? Thanks Varun
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12580300/4721-trunk-v2.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4306//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4306//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12580300/4721-trunk-v2.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4306//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4306//console This message is automatically generated.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > ... lease recovery can occur and the lease be reassigned to someone else while the block has not been recovered which is, its totally possible for lease to be recovered without the block being recovered ...

          If the block is still being recovered, the new lease holder should not be able write to the file.

          > Basically, I see that call "recoverLease" - lease is recovered and block recovery enqueued for bad replica (without this patch). Call recoverLease again and it says lease recovery was successful even though the block was never recovered. ...

          Do the mean the block is corrupted?

          recoverLease return true only if the file is closed. If not, it is a bug.

          Show
          Tsz Wo Nicholas Sze added a comment - > ... lease recovery can occur and the lease be reassigned to someone else while the block has not been recovered which is, its totally possible for lease to be recovered without the block being recovered ... If the block is still being recovered, the new lease holder should not be able write to the file. > Basically, I see that call "recoverLease" - lease is recovered and block recovery enqueued for bad replica (without this patch). Call recoverLease again and it says lease recovery was successful even though the block was never recovered. ... Do the mean the block is corrupted? recoverLease return true only if the file is closed. If not, it is a bug.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Varun, it seems that the patch causes TestPipelinesFailover to fail. Could you check it?

          Show
          Tsz Wo Nicholas Sze added a comment - Varun, it seems that the patch causes TestPipelinesFailover to fail. Could you check it?
          Hide
          Varun Sharma added a comment -

          Sure, I am checking on the unit tests.

          Okay, I was seeing the following behaviour - hadoop 2.0.0 alpha (without this patch)...

          a) Client calls recoverLease
          b) Namenode enqueues block recovery for primary DN which is dead, hence block recovery should never happen
          c) Client is returned a value of false
          d) Client calls recoveryLease after 1 second
          e) Client is returned a value of true even though the block recovery did not happen

          Is this a bug ?

          Varun

          Show
          Varun Sharma added a comment - Sure, I am checking on the unit tests. Okay, I was seeing the following behaviour - hadoop 2.0.0 alpha (without this patch)... a) Client calls recoverLease b) Namenode enqueues block recovery for primary DN which is dead, hence block recovery should never happen c) Client is returned a value of false d) Client calls recoveryLease after 1 second e) Client is returned a value of true even though the block recovery did not happen Is this a bug ? Varun
          Hide
          Tsz Wo Nicholas Sze added a comment -

          At (e), is the file closed? Although it is unlikely, could it be another lease recovery happening in the background? Or could the primary DN actually is not dead? You may check the NN log to find out why the file can be closed.

          Show
          Tsz Wo Nicholas Sze added a comment - At (e), is the file closed? Although it is unlikely, could it be another lease recovery happening in the background? Or could the primary DN actually is not dead? You may check the NN log to find out why the file can be closed.
          Hide
          Varun Sharma added a comment -

          I dont think - if i grep the file name - all I have is the following

          2013-04-24 05:40:30,282 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock: /hbase/.logs/ip-10-170-15-97.ec2.internal,60020,1366780717760/ip-10-170-15-97.ec2.internal%2C60020%2C1366780717760.1366782030238. BP-889095791-10.171.1.40-1366491606582 blk_-2482251885029951704_11942

          {blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.170.15.97:50010|RBW], ReplicaUnderConstruction[10.168.12.138:50010|RBW], ReplicaUnderConstruction[10.170.6.131:50010|RBW]]}

          2013-04-24 05:40:31,655 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* fsync: /hbase/.logs/ip-10-170-15-97.ec2.internal,60020,1366780717760/ip-10-170-15-97.ec2.internal%2C60020%2C1366780717760.1366782030238 for DFSClient_NONMAPREDUCE_-1195338611_41
          2013-04-24 06:14:43,623 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: recoverLease: [Lease. Holder: DFSClient_NONMAPREDUCE_-1195338611_41, pendingcreates: 1], src=/hbase/.logs/ip-10-170-15-97.ec2.internal,60020,1366780717760-splitting/ip-10-170-15-97.ec2.internal%2C60020%2C1366780717760.1366782030238 from client DFSClient_NONMAPREDUCE_-1195338611_41
          2013-04-24 06:14:43,623 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease. Holder: DFSClient_NONMAPREDUCE_-1195338611_41, pendingcreates: 1], src=/hbase/.logs/ip-10-170-15-97.ec2.internal,60020,1366780717760-splitting/ip-10-170-15-97.ec2.internal%2C60020%2C1366780717760.1366782030238
          2013-04-24 06:14:43,623 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File /hbase/.logs/ip-10-170-15-97.ec2.internal,60020,1366780717760-splitting/ip-10-170-15-97.ec2.internal%2C60020%2C1366780717760.1366782030238 has not been closed. Lease recovery is in progress. RecoveryId = 12012 for block blk_-2482251885029951704_11942

          {blockUCState=UNDER_RECOVERY, primaryNodeIndex=0, replicas=[ReplicaUnderConstruction[10.170.15.97:50010|RBW], ReplicaUnderConstruction[10.168.12.138:50010|RBW], ReplicaUnderConstruction[10.170.6.131:50010|RBW]]}

          So only one lease recovery call. One second after this recoverLease returns true, not sure why though...

          Show
          Varun Sharma added a comment - I dont think - if i grep the file name - all I have is the following 2013-04-24 05:40:30,282 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock: /hbase/.logs/ip-10-170-15-97.ec2.internal,60020,1366780717760/ip-10-170-15-97.ec2.internal%2C60020%2C1366780717760.1366782030238. BP-889095791-10.171.1.40-1366491606582 blk_-2482251885029951704_11942 {blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.170.15.97:50010|RBW], ReplicaUnderConstruction[10.168.12.138:50010|RBW], ReplicaUnderConstruction[10.170.6.131:50010|RBW]]} 2013-04-24 05:40:31,655 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* fsync: /hbase/.logs/ip-10-170-15-97.ec2.internal,60020,1366780717760/ip-10-170-15-97.ec2.internal%2C60020%2C1366780717760.1366782030238 for DFSClient_NONMAPREDUCE_-1195338611_41 2013-04-24 06:14:43,623 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: recoverLease: [Lease. Holder: DFSClient_NONMAPREDUCE_-1195338611_41, pendingcreates: 1] , src=/hbase/.logs/ip-10-170-15-97.ec2.internal,60020,1366780717760-splitting/ip-10-170-15-97.ec2.internal%2C60020%2C1366780717760.1366782030238 from client DFSClient_NONMAPREDUCE_-1195338611_41 2013-04-24 06:14:43,623 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease. Holder: DFSClient_NONMAPREDUCE_-1195338611_41, pendingcreates: 1] , src=/hbase/.logs/ip-10-170-15-97.ec2.internal,60020,1366780717760-splitting/ip-10-170-15-97.ec2.internal%2C60020%2C1366780717760.1366782030238 2013-04-24 06:14:43,623 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File /hbase/.logs/ip-10-170-15-97.ec2.internal,60020,1366780717760-splitting/ip-10-170-15-97.ec2.internal%2C60020%2C1366780717760.1366782030238 has not been closed. Lease recovery is in progress. RecoveryId = 12012 for block blk_-2482251885029951704_11942 {blockUCState=UNDER_RECOVERY, primaryNodeIndex=0, replicas=[ReplicaUnderConstruction[10.170.15.97:50010|RBW], ReplicaUnderConstruction[10.168.12.138:50010|RBW], ReplicaUnderConstruction[10.170.6.131:50010|RBW]]} So only one lease recovery call. One second after this recoverLease returns true, not sure why though...
          Hide
          Varun Sharma added a comment -

          Time t - recoveryLease called with false return - block recovery enqueued onto a good node (because of this patch)
          Time (t + 5) - recoveryLease called again and return value of true
          Time (t + 7) - block recovery done and commitSynchronized and the block is finalized

          Does this timeline look right to you ?

          Varun

          Show
          Varun Sharma added a comment - Time t - recoveryLease called with false return - block recovery enqueued onto a good node (because of this patch) Time (t + 5) - recoveryLease called again and return value of true Time (t + 7) - block recovery done and commitSynchronized and the block is finalized Does this timeline look right to you ? Varun
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Could you also grep "blk_-2482251885029951704"?

          From FSNamesystem.recoverLease(..), it is clear the it will return true only if the file is not under construction, i.e. the file is closed.

          Show
          Tsz Wo Nicholas Sze added a comment - Could you also grep "blk_-2482251885029951704"? From FSNamesystem.recoverLease(..), it is clear the it will return true only if the file is not under construction, i.e. the file is closed.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          BTW, what is the timestamp that the file is closed?

          Show
          Tsz Wo Nicholas Sze added a comment - BTW, what is the timestamp that the file is closed?
          Hide
          Varun Sharma added a comment -

          Here are the remainder messages - looking at it there are messages 40 minutes later when I bring back the dead datanode. I think it reports the block and a recovery is performed, then, since its still in the recovery queue.

          2013-04-24 06:57:14,373 INFO BlockStateChange: BLOCK* processReport: blk_-2482251885029951704_11942 on 10.168.12.138:50010 size 7039284 does not belong to any file
          2013-04-24 06:57:14,373 INFO BlockStateChange: BLOCK* InvalidateBlocks: add blk_-2482251885029951704_11942 to 10.168.12.138:50010
          2013-04-24 06:57:17,240 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* InvalidateBlocks: ask 10.168.12.138:50010 to delete [blk_-121400693146753449_11986, blk_7815495529310756756_10715, blk_4125941153395778345_10713, blk_7979989947202390292_11938, blk_-2482251885029951704_11942, blk_-2834772731171489244_10711]
          2013-04-24 09:14:25,284 INFO BlockStateChange: BLOCK* processReport: blk_-2482251885029951704_11942 on 10.170.6.131:50010 size 7039284 does not belong to any file
          2013-04-24 09:14:25,284 INFO BlockStateChange: BLOCK* InvalidateBlocks: add blk_-2482251885029951704_11942 to 10.170.6.131:50010
          2013-04-24 09:14:26,916 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* InvalidateBlocks: ask 10.170.6.131:50010 to delete [blk_-6242914570577158362_12305, blk_7396709163981662539_11419, blk_-121400693146753449_11986, blk_7815495529310756756_10716, blk_8175754220082115190_12303, blk_1204694577977643985_12307, blk_4125941153395778345_10718, blk_7979989947202390292_11938, blk_-2482251885029951704_11942, blk_-3317357101836432862_12390, blk_-5206526708499881023_11940, blk_-2834772731171489244_10717]
          2013-04-24 16:38:26,254 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: commitBlockSynchronization(lastblock=BP-889095791-10.171.1.40-1366491606582:blk_-2482251885029951704_11942, newgenerationstamp=12012, newlength=7044280, newtargets=[10.170.15.97:50010], closeFile=true, deleteBlock=false)
          2013-04-24 16:38:26,255 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:hdfs (auth:SIMPLE) cause:java.io.IOException: Block (=BP-889095791-10.171.1.40-1366491606582:blk_-2482251885029951704_11942) not found
          2013-04-24 16:38:26,255 INFO org.apache.hadoop.ipc.Server: IPC Server handler 55 on 8020, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.commitBlockSynchronization from 10.170.15.97:44875: error: java.io.IOException: Block (=BP-889095791-10.171.1.40-1366491606582:blk_-2482251885029951704_11942) not found
          java.io.IOException: Block (=BP-889095791-10.171.1.40-1366491606582:blk_-2482251885029951704_11942) not found
          2013-04-24 16:38:26,255 INFO BlockStateChange: BLOCK* addBlock: block blk_-2482251885029951704_12012 on 10.170.15.97:50010 size 7044280 does not belong to any file
          2013-04-24 16:38:26,255 INFO BlockStateChange: BLOCK* InvalidateBlocks: add blk_-2482251885029951704_12012 to 10.170.15.97:50010
          2013-04-24 16:38:28,766 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* InvalidateBlocks: ask 10.170.15.97:50010 to delete [blk_-121400693146753449_12233, blk_-2482251885029951704_12012, blk_7979989947202390292_11989]

          Show
          Varun Sharma added a comment - Here are the remainder messages - looking at it there are messages 40 minutes later when I bring back the dead datanode. I think it reports the block and a recovery is performed, then, since its still in the recovery queue. 2013-04-24 06:57:14,373 INFO BlockStateChange: BLOCK* processReport: blk_-2482251885029951704_11942 on 10.168.12.138:50010 size 7039284 does not belong to any file 2013-04-24 06:57:14,373 INFO BlockStateChange: BLOCK* InvalidateBlocks: add blk_-2482251885029951704_11942 to 10.168.12.138:50010 2013-04-24 06:57:17,240 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* InvalidateBlocks: ask 10.168.12.138:50010 to delete [blk_-121400693146753449_11986, blk_7815495529310756756_10715, blk_4125941153395778345_10713, blk_7979989947202390292_11938, blk_-2482251885029951704_11942, blk_-2834772731171489244_10711] 2013-04-24 09:14:25,284 INFO BlockStateChange: BLOCK* processReport: blk_-2482251885029951704_11942 on 10.170.6.131:50010 size 7039284 does not belong to any file 2013-04-24 09:14:25,284 INFO BlockStateChange: BLOCK* InvalidateBlocks: add blk_-2482251885029951704_11942 to 10.170.6.131:50010 2013-04-24 09:14:26,916 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* InvalidateBlocks: ask 10.170.6.131:50010 to delete [blk_-6242914570577158362_12305, blk_7396709163981662539_11419, blk_-121400693146753449_11986, blk_7815495529310756756_10716, blk_8175754220082115190_12303, blk_1204694577977643985_12307, blk_4125941153395778345_10718, blk_7979989947202390292_11938, blk_-2482251885029951704_11942, blk_-3317357101836432862_12390, blk_-5206526708499881023_11940, blk_-2834772731171489244_10717] 2013-04-24 16:38:26,254 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: commitBlockSynchronization(lastblock=BP-889095791-10.171.1.40-1366491606582:blk_-2482251885029951704_11942, newgenerationstamp=12012, newlength=7044280, newtargets= [10.170.15.97:50010] , closeFile=true, deleteBlock=false) 2013-04-24 16:38:26,255 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:hdfs (auth:SIMPLE) cause:java.io.IOException: Block (=BP-889095791-10.171.1.40-1366491606582:blk_-2482251885029951704_11942) not found 2013-04-24 16:38:26,255 INFO org.apache.hadoop.ipc.Server: IPC Server handler 55 on 8020, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.commitBlockSynchronization from 10.170.15.97:44875: error: java.io.IOException: Block (=BP-889095791-10.171.1.40-1366491606582:blk_-2482251885029951704_11942) not found java.io.IOException: Block (=BP-889095791-10.171.1.40-1366491606582:blk_-2482251885029951704_11942) not found 2013-04-24 16:38:26,255 INFO BlockStateChange: BLOCK* addBlock: block blk_-2482251885029951704_12012 on 10.170.15.97:50010 size 7044280 does not belong to any file 2013-04-24 16:38:26,255 INFO BlockStateChange: BLOCK* InvalidateBlocks: add blk_-2482251885029951704_12012 to 10.170.15.97:50010 2013-04-24 16:38:28,766 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* InvalidateBlocks: ask 10.170.15.97:50010 to delete [blk_-121400693146753449_12233, blk_-2482251885029951704_12012, blk_7979989947202390292_11989]
          Hide
          Varun Sharma added a comment -

          One more thing, currently we do it this way:

          1) Client is writing to File F1 (a datanode livees on the client)
          2) Client+datanode on the machine crash
          3) F1 is renamed to F2 - I don't think we try to recover lease here
          4) After the rename is done, we call recover lease on the file

          And then the sequence of incidents above - do you think rename is the culprit here...

          Show
          Varun Sharma added a comment - One more thing, currently we do it this way: 1) Client is writing to File F1 (a datanode livees on the client) 2) Client+datanode on the machine crash 3) F1 is renamed to F2 - I don't think we try to recover lease here 4) After the rename is done, we call recover lease on the file And then the sequence of incidents above - do you think rename is the culprit here...
          Hide
          Varun Sharma added a comment -

          Sorry I mean the directory containing the "File" is renamed

          so

          /dir1/f1 to /dir2/f1

          Not sure how i can find the close timestamp - is it in the NN logs - the client calling recoverLease only uses it for reads...

          Show
          Varun Sharma added a comment - Sorry I mean the directory containing the "File" is renamed so /dir1/f1 to /dir2/f1 Not sure how i can find the close timestamp - is it in the NN logs - the client calling recoverLease only uses it for reads...
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Is there anything between 2013-04-24 05:40:30,282 and 2013-04-24 06:57:14,373 for "blk_-2482251885029951704"? I want to see how the file got closed. Or if you don't mind and the file is not too large, could post he entire log?

          If you could reproduce this, please add a message to log the time that recoverLease return true.

          Show
          Tsz Wo Nicholas Sze added a comment - Is there anything between 2013-04-24 05:40:30,282 and 2013-04-24 06:57:14,373 for "blk_-2482251885029951704"? I want to see how the file got closed. Or if you don't mind and the file is not too large, could post he entire log? If you could reproduce this, please add a message to log the time that recoverLease return true.
          Hide
          Varun Sharma added a comment -

          So there are three messages in the 1st log snippet i sent

          one at 5:40 when the block is created...
          Next 2 at 6:14
          And then at 6:57 (second snippet)

          Okay, I am looking at this more closely now. It seems that at 16:38 I brought back the DataNode - then it actually got the recovery request. I can match up the recoveryId = 12012. Also, you can see the "closeFile=true". So its like choose primaryDN for recovery, return false and then return true after 2-3 seconds to the client. Primary DN comes back after 6-7 hours - now block recovery happens, commit Synchronization as below..

          blk_-3317357101836432862_12390, blk_-5206526708499881023_11940, blk_-2834772731171489244_10717]
          2013-04-24 16:38:26,254 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: commitBlockSynchronization(lastblock=BP-889095791-10.171.1.40-1366491606582:blk_-2482251885029951704_11942, newgenerationstamp=12012, newlength=7044280, newtargets=[10.170.15.97:50010], closeFile=true, deleteBlock=false)

          Show
          Varun Sharma added a comment - So there are three messages in the 1st log snippet i sent one at 5:40 when the block is created... Next 2 at 6:14 And then at 6:57 (second snippet) Okay, I am looking at this more closely now. It seems that at 16:38 I brought back the DataNode - then it actually got the recovery request. I can match up the recoveryId = 12012. Also, you can see the "closeFile=true". So its like choose primaryDN for recovery, return false and then return true after 2-3 seconds to the client. Primary DN comes back after 6-7 hours - now block recovery happens, commit Synchronization as below.. blk_-3317357101836432862_12390, blk_-5206526708499881023_11940, blk_-2834772731171489244_10717] 2013-04-24 16:38:26,254 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: commitBlockSynchronization(lastblock=BP-889095791-10.171.1.40-1366491606582:blk_-2482251885029951704_11942, newgenerationstamp=12012, newlength=7044280, newtargets= [10.170.15.97:50010] , closeFile=true, deleteBlock=false)
          Hide
          Varun Sharma added a comment -

          I have already pasted every single line for the block - there is no more than this...

          Show
          Varun Sharma added a comment - I have already pasted every single line for the block - there is no more than this...
          Hide
          Varun Sharma added a comment -

          Tsz Wo Nicholas Sze

          Sorry, I traced this down to a bug in the client. HDFS lease recovery seems to be perfect...

          Varun

          Show
          Varun Sharma added a comment - Tsz Wo Nicholas Sze Sorry, I traced this down to a bug in the client. HDFS lease recovery seems to be perfect... Varun
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12580494/4721-trunk-v3.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4316//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4316//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12580494/4721-trunk-v3.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 3 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4316//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4316//console This message is automatically generated.
          Hide
          Varun Sharma added a comment -

          Tsz Wo Nicholas Sze

          The tests are passing with the latest patch. Should we modify the description for stale node interval and suggest that we use it for block recovery as well ?

          Show
          Varun Sharma added a comment - Tsz Wo Nicholas Sze The tests are passing with the latest patch. Should we modify the description for stale node interval and suggest that we use it for block recovery as well ?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          That's great. Please feel free to modify the description.

          Show
          Tsz Wo Nicholas Sze added a comment - That's great. Please feel free to modify the description.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12580722/4721-trunk-v4.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4325//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4325//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12580722/4721-trunk-v4.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 3 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4325//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4325//console This message is automatically generated.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          +1 patch looks good.

          Show
          Tsz Wo Nicholas Sze added a comment - +1 patch looks good.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          I have committed this. Thanks, Varun!

          Show
          Tsz Wo Nicholas Sze added a comment - I have committed this. Thanks, Varun!
          Hide
          Hudson added a comment -

          Integrated in Hadoop-trunk-Commit #3673 (See https://builds.apache.org/job/Hadoop-trunk-Commit/3673/)
          HDFS-4721. Speed up lease recovery by avoiding stale datanodes and choosing the datanode with the most recent heartbeat as the primary. Contributed by Varun Sharma (Revision 1476399)

          Result = SUCCESS
          szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1476399
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockInfoUnderConstruction.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestBlockInfoUnderConstruction.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestHeartbeatHandling.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestPipelinesFailover.java
          Show
          Hudson added a comment - Integrated in Hadoop-trunk-Commit #3673 (See https://builds.apache.org/job/Hadoop-trunk-Commit/3673/ ) HDFS-4721 . Speed up lease recovery by avoiding stale datanodes and choosing the datanode with the most recent heartbeat as the primary. Contributed by Varun Sharma (Revision 1476399) Result = SUCCESS szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1476399 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockInfoUnderConstruction.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestBlockInfoUnderConstruction.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestHeartbeatHandling.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestPipelinesFailover.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Yarn-trunk #196 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/196/)
          HDFS-4721. Speed up lease recovery by avoiding stale datanodes and choosing the datanode with the most recent heartbeat as the primary. Contributed by Varun Sharma (Revision 1476399)

          Result = SUCCESS
          szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1476399
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockInfoUnderConstruction.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestBlockInfoUnderConstruction.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestHeartbeatHandling.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestPipelinesFailover.java
          Show
          Hudson added a comment - Integrated in Hadoop-Yarn-trunk #196 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/196/ ) HDFS-4721 . Speed up lease recovery by avoiding stale datanodes and choosing the datanode with the most recent heartbeat as the primary. Contributed by Varun Sharma (Revision 1476399) Result = SUCCESS szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1476399 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockInfoUnderConstruction.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestBlockInfoUnderConstruction.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestHeartbeatHandling.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestPipelinesFailover.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #1385 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1385/)
          HDFS-4721. Speed up lease recovery by avoiding stale datanodes and choosing the datanode with the most recent heartbeat as the primary. Contributed by Varun Sharma (Revision 1476399)

          Result = FAILURE
          szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1476399
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockInfoUnderConstruction.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestBlockInfoUnderConstruction.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestHeartbeatHandling.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestPipelinesFailover.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #1385 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1385/ ) HDFS-4721 . Speed up lease recovery by avoiding stale datanodes and choosing the datanode with the most recent heartbeat as the primary. Contributed by Varun Sharma (Revision 1476399) Result = FAILURE szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1476399 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockInfoUnderConstruction.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestBlockInfoUnderConstruction.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestHeartbeatHandling.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestPipelinesFailover.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #1412 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1412/)
          HDFS-4721. Speed up lease recovery by avoiding stale datanodes and choosing the datanode with the most recent heartbeat as the primary. Contributed by Varun Sharma (Revision 1476399)

          Result = SUCCESS
          szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1476399
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockInfoUnderConstruction.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestBlockInfoUnderConstruction.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestHeartbeatHandling.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestPipelinesFailover.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #1412 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1412/ ) HDFS-4721 . Speed up lease recovery by avoiding stale datanodes and choosing the datanode with the most recent heartbeat as the primary. Contributed by Varun Sharma (Revision 1476399) Result = SUCCESS szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1476399 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockInfoUnderConstruction.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestBlockInfoUnderConstruction.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestHeartbeatHandling.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestPipelinesFailover.java
          Hide
          Ted Yu added a comment -

          Should this feature be backported to branch-1 ?

          Show
          Ted Yu added a comment - Should this feature be backported to branch-1 ?
          Hide
          Varun Sharma added a comment -

          I think so...

          Show
          Varun Sharma added a comment - I think so...

            People

            • Assignee:
              Varun Sharma
              Reporter:
              Varun Sharma
            • Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development