Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-1172

Blocks in newly completed files are considered under-replicated too quickly

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.21.0
    • Fix Version/s: None
    • Component/s: namenode
    • Labels:
      None

      Description

      I've seen this for a long time, and imagine it's a known issue, but couldn't find an existing JIRA. It often happens that we see the NN schedule replication on the last block of files very quickly after they're completed, before the other DNs in the pipeline have a chance to report the new block. This results in a lot of extra replication work on the cluster, as we replicate the block and then end up with multiple excess replicas which are very quickly deleted.

      1. hdfs-1172.txt
        21 kB
        Eli Collins
      2. hdfs-1172.txt
        20 kB
        Todd Lipcon
      3. replicateBlocksFUC1.patch
        8 kB
        Hairong Kuang
      4. replicateBlocksFUC1.patch
        8 kB
        Hairong Kuang
      5. replicateBlocksFUC.patch
        4 kB
        Hairong Kuang
      6. HDFS-1172.patch
        3 kB
        Boris Shkolnik

        Activity

        Hide
        Fengdong Yu added a comment -

        This litters the task logs with the NotReplicatedYetException

        This does look like client require a new block before the previous block pipeline is not finished.

        Show
        Fengdong Yu added a comment - This litters the task logs with the NotReplicatedYetException This does look like client require a new block before the previous block pipeline is not finished.
        Hide
        Ravi Prakash added a comment -

        I am able to consistently reproduce this issue with the following command on an 80 node cluster:
        hadoop jar $HADOOP_PREFIX/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-*-tests.jar SliveTest -baseDir /user/someUser/slive -duration 120 -dirSize 122500 -files 122500 -maps 560 -reduces 1 -seed 1 -ops 100 -readSize 1048576,1048576 -writeSize 1048576,1048576 -appendSize 1048576,1048576 -replication 1,1 -blockSize 1024,1024 -delete 0,uniform -create 100,uniform -mkdir 0,uniform -rename 0,uniform -append 0,uniform -ls 0,uniform -read 0,uniform

        This litters the task logs with the NotReplicatedYetException
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1268)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:469)

        Show
        Ravi Prakash added a comment - I am able to consistently reproduce this issue with the following command on an 80 node cluster: hadoop jar $HADOOP_PREFIX/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-*-tests.jar SliveTest -baseDir /user/someUser/slive -duration 120 -dirSize 122500 -files 122500 -maps 560 -reduces 1 -seed 1 -ops 100 -readSize 1048576,1048576 -writeSize 1048576,1048576 -appendSize 1048576,1048576 -replication 1,1 -blockSize 1024,1024 -delete 0,uniform -create 100,uniform -mkdir 0,uniform -rename 0,uniform -append 0,uniform -ls 0,uniform -read 0,uniform This litters the task logs with the NotReplicatedYetException at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1268) at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:469)
        Hide
        Matt Foley added a comment -

        Changed Target Version to 1.3.0 upon release of 1.2.0. Please change to 1.2.1 if you intend to submit a fix for branch-1.2.

        Show
        Matt Foley added a comment - Changed Target Version to 1.3.0 upon release of 1.2.0. Please change to 1.2.1 if you intend to submit a fix for branch-1.2.
        Hide
        Todd Lipcon added a comment -

        Hey folks. Sorry I let this one drop off my radar for a couple years I don't think I'll have time to work on it in the coming months, so if you want to take it over, go ahead. I think the remaining issue was that the test coverage is still a little weak (and will probably need significant rebasing for 2.x/3.x)

        Show
        Todd Lipcon added a comment - Hey folks. Sorry I let this one drop off my radar for a couple years I don't think I'll have time to work on it in the coming months, so if you want to take it over, go ahead. I think the remaining issue was that the test coverage is still a little weak (and will probably need significant rebasing for 2.x/3.x)
        Hide
        Uma Maheswara Rao G added a comment -

        Hi Todd, Once we convert the file to underConstruction, we are recreating BlockUnderCOnstruction object if it is already completed right?

        public BlockInfoUnderConstruction convertToBlockUnderConstruction(
              BlockUCState s, DatanodeDescriptor[] targets) {
            if(isComplete()) {
              return new BlockInfoUnderConstruction(
                  this, getBlockCollection().getBlockReplication(), s, targets);
            }
        

        So, here '==' comparision may create issue here? After this conversion, Even though it is in underConstruction state it may return false, since block references might be different from neededReplications list and lastBlock from InodeFileUnderConstruction?

        Show
        Uma Maheswara Rao G added a comment - Hi Todd, Once we convert the file to underConstruction, we are recreating BlockUnderCOnstruction object if it is already completed right? public BlockInfoUnderConstruction convertToBlockUnderConstruction( BlockUCState s, DatanodeDescriptor[] targets) { if (isComplete()) { return new BlockInfoUnderConstruction( this , getBlockCollection().getBlockReplication(), s, targets); } So, here '==' comparision may create issue here? After this conversion, Even though it is in underConstruction state it may return false, since block references might be different from neededReplications list and lastBlock from InodeFileUnderConstruction?
        Hide
        Amareshwari Sriramadasu added a comment -

        @Todd, Is there any update on this?
        We are hitting similar issue in our cluster and number of excess blocks are reaching to 1 Lakh in a day. I raised HDFS-4562 for the same, which would be duplicate of this.

        Show
        Amareshwari Sriramadasu added a comment - @Todd, Is there any update on this? We are hitting similar issue in our cluster and number of excess blocks are reaching to 1 Lakh in a day. I raised HDFS-4562 for the same, which would be duplicate of this.
        Hide
        Eli Collins added a comment -

        Updated patch rebased on trunk.

        Show
        Eli Collins added a comment - Updated patch rebased on trunk.
        Hide
        Todd Lipcon added a comment -

        I'm worried that there are some other bugs lurking here – ie the fact that our test coverage doesn't check this means that our understanding of the state of the world is somehow broken. So I'm hesitant to commit a change here until we really understand what's going on. If some other folks who know this area of the code well can take a look, I'd be more inclined to commit for 23.

        Show
        Todd Lipcon added a comment - I'm worried that there are some other bugs lurking here – ie the fact that our test coverage doesn't check this means that our understanding of the state of the world is somehow broken. So I'm hesitant to commit a change here until we really understand what's going on. If some other folks who know this area of the code well can take a look, I'd be more inclined to commit for 23.
        Hide
        Ravi Prakash added a comment -

        Thanks Todd! Its not causing any big issues. Its just something our operations folks were expecting in the 0.23 release. And given that the first rc just got branched, I was hoping this would get in there. For now do you think it would be possible to commit the patch without the unit test and come back for the unit test later?

        Show
        Ravi Prakash added a comment - Thanks Todd! Its not causing any big issues. Its just something our operations folks were expecting in the 0.23 release. And given that the first rc just got branched, I was hoping this would get in there. For now do you think it would be possible to commit the patch without the unit test and come back for the unit test later?
        Hide
        Todd Lipcon added a comment -

        I went back and looked at my branch where I was working on this patch. The remaining work is to add a test which catches the issue you pointed out with == vs .equals. Since the tests were passing even with that glaring mistake, the coverage definitely wasn't good enough. I started to write one and I think I ran into some more issues, but I can't recall what they were. Since this issue has been around forever, I haven't been able to prioritize it above other 0.23 work. Is this causing big issues on your clusters that would suggest it should be prioritized higher?

        Show
        Todd Lipcon added a comment - I went back and looked at my branch where I was working on this patch. The remaining work is to add a test which catches the issue you pointed out with == vs .equals. Since the tests were passing even with that glaring mistake, the coverage definitely wasn't good enough. I started to write one and I think I ran into some more issues, but I can't recall what they were. Since this issue has been around forever, I haven't been able to prioritize it above other 0.23 work. Is this causing big issues on your clusters that would suggest it should be prioritized higher?
        Hide
        Ravi Prakash added a comment -

        Hi Todd! Are you going to be able to finish this patch? Is there anything more to be done than to change the == to .equals() and maybe my other nitpicks?

        Show
        Ravi Prakash added a comment - Hi Todd! Are you going to be able to finish this patch? Is there anything more to be done than to change the == to .equals() and maybe my other nitpicks?
        Hide
        Ravi Prakash added a comment -

        Hi Todd! Sorry for bothering you again! Any progress?

        Show
        Ravi Prakash added a comment - Hi Todd! Sorry for bothering you again! Any progress?
        Hide
        Todd Lipcon added a comment -

        Hi Ravi. I did spend some time on this last week but I ended up stuck in a rabbit hole of some sort (now I can't remember what it was). I will revive that branch and see if I can get a new patch up this week. Thanks for the reminder.

        Show
        Todd Lipcon added a comment - Hi Ravi. I did spend some time on this last week but I ended up stuck in a rabbit hole of some sort (now I can't remember what it was). I will revive that branch and see if I can get a new patch up this week. Thanks for the reminder.
        Hide
        Ravi Prakash added a comment -

        Hi Todd. Did you have a chance to update the patch?

        Show
        Ravi Prakash added a comment - Hi Todd. Did you have a chance to update the patch?
        Hide
        Todd Lipcon added a comment -

        Hi Ravi. I think you're right, good catch. I spent some time yesterday working on writing a test that shows this bug, since the existing ones clearly don't do enough coverage. I'll upload something new soon.

        Show
        Todd Lipcon added a comment - Hi Ravi. I think you're right, good catch. I spent some time yesterday working on writing a test that shows this bug, since the existing ones clearly don't do enough coverage. I'll upload something new soon.
        Hide
        Ravi Prakash added a comment -

        I looked more closely. I think the return lastBlock == block; ought to to be return lastBlock.equals(block); IMO this would be a bug. So I'm taking back my precious +1.

        Todd can you please make the change / correct me if I'm wrong?

        Show
        Ravi Prakash added a comment - I looked more closely. I think the return lastBlock == block; ought to to be return lastBlock.equals(block); IMO this would be a bug. So I'm taking back my precious +1. Todd can you please make the change / correct me if I'm wrong?
        Hide
        Ravi Prakash added a comment -

        Thanks Todd for taking care of this!

        +1 to the patch

        • Nitpicking, should we just have a boolean cached for the isLastBlockOfUnderConstructionFile calls on 1131 and 1063?
        • Is line 1225
          return lastBlock == block;

          the same as an equality check?

        • In TestReplication.java:testReplicationWhileUnderConstruction(), after marking one block as bad (line 588), is there a quick check we can do to verify that indeed a block was added to the pending queue?
        Show
        Ravi Prakash added a comment - Thanks Todd for taking care of this! +1 to the patch Nitpicking, should we just have a boolean cached for the isLastBlockOfUnderConstructionFile calls on 1131 and 1063? Is line 1225 return lastBlock == block; the same as an equality check? In TestReplication.java:testReplicationWhileUnderConstruction(), after marking one block as bad (line 588), is there a quick check we can do to verify that indeed a block was added to the pending queue?
        Hide
        Todd Lipcon added a comment -

        Here's a new patch against trunk for this issue.

        A few things changed since Hairong's original patch:

        • I removed the part of the test that changes the replication factor of a file while it's under construction. This part of the test wasn't succeeding reliably, since it was running into a different bug: HDFS-2283
        • added the test code from HDFS-1197 which allows the DNs to artificially delay blockReceived calls in the tests. This exposed some other bugs with the patch
        • the new replicateLastBlock code needed to be called in a different place:
          • the original patch called this on every attempt of completeFile(), rather than on only the final/successful attempt. This meant that, if the replicas were very slow to check in, the targets would be added to pendingReplication many times, yielding a pending replica count much larger than the actual replication factor
          • the code needs to be called for all blocks, not just the last block in a file

        I looped the new tests for a while and they pass reliably.

        Show
        Todd Lipcon added a comment - Here's a new patch against trunk for this issue. A few things changed since Hairong's original patch: I removed the part of the test that changes the replication factor of a file while it's under construction. This part of the test wasn't succeeding reliably, since it was running into a different bug: HDFS-2283 added the test code from HDFS-1197 which allows the DNs to artificially delay blockReceived calls in the tests. This exposed some other bugs with the patch the new replicateLastBlock code needed to be called in a different place: the original patch called this on every attempt of completeFile(), rather than on only the final/successful attempt. This meant that, if the replicas were very slow to check in, the targets would be added to pendingReplication many times, yielding a pending replica count much larger than the actual replication factor the code needs to be called for all blocks, not just the last block in a file I looped the new tests for a while and they pass reliably.
        Hide
        Todd Lipcon added a comment -

        Reassigning patch to try to get this fixed for 23.

        Show
        Todd Lipcon added a comment - Reassigning patch to try to get this fixed for 23.
        Hide
        Todd Lipcon added a comment -

        This patch looks good. Only question: does the new unit test properly fail if you remove the fix in BlockManager? It seems we should be doing something to artifically delay the block report of of the DataNodes. In HDFS-1197 there is some test code that allows one to specify a delay in the DN configuration to simulate this kind of condition.

        Show
        Todd Lipcon added a comment - This patch looks good. Only question: does the new unit test properly fail if you remove the fix in BlockManager? It seems we should be doing something to artifically delay the block report of of the DataNodes. In HDFS-1197 there is some test code that allows one to specify a delay in the DN configuration to simulate this kind of condition.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12476527/replicateBlocksFUC1.patch
        against trunk revision 1094748.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        -1 javadoc. The javadoc tool appears to have generated 1 warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these core unit tests:
        org.apache.hadoop.hdfs.TestFileConcurrentReader

        -1 contrib tests. The patch failed contrib unit tests.

        +1 system test framework. The patch passed system test framework compile.

        Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/386//testReport/
        Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/386//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/386//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12476527/replicateBlocksFUC1.patch against trunk revision 1094748. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.TestFileConcurrentReader -1 contrib tests. The patch failed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/386//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/386//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/386//console This message is automatically generated.
        Hide
        Hairong Kuang added a comment -

        Resubmitting this to trigger hudson.

        Show
        Hairong Kuang added a comment - Resubmitting this to trigger hudson.
        Hide
        Hairong Kuang added a comment -

        Thank Matt for pointing out the additional memory benefit that this fix could provide. This patch could benefit datanode decommission too.

        Show
        Hairong Kuang added a comment - Thank Matt for pointing out the additional memory benefit that this fix could provide. This patch could benefit datanode decommission too.
        Hide
        Hairong Kuang added a comment -

        This patch
        1. makes sure that blocks in a newly-closed file does not get over-replicated;
        2. makes sure that blocks except for the last block in a file under-construction get replicated when under-replicated; This will allow a decommissioning datanode to finish decommissioning even it has replicas in files under construction.
        3. adds a unit test.

        Show
        Hairong Kuang added a comment - This patch 1. makes sure that blocks in a newly-closed file does not get over-replicated; 2. makes sure that blocks except for the last block in a file under-construction get replicated when under-replicated; This will allow a decommissioning datanode to finish decommissioning even it has replicas in files under construction. 3. adds a unit test.
        Hide
        Matt Foley added a comment -

        Fixing this issue will not only remove a performance issue, it will also help with memory management. Every over-replicated block gets its "triplets" array re-allocated. This is a set of (3 x replication) object references used to link the block into each datanode's blockList. If the replica count becomes greater than the replication factor, this array gets re-allocated, and it never gets shrunk if the replica count decreases. If this is happening with essentially every new block, then there's an awful lot of excess memory being wasted on unused triplets. In a 200M block namenode, one excess triplet per block is 4.8GB!

        Show
        Matt Foley added a comment - Fixing this issue will not only remove a performance issue, it will also help with memory management. Every over-replicated block gets its "triplets" array re-allocated. This is a set of (3 x replication) object references used to link the block into each datanode's blockList. If the replica count becomes greater than the replication factor, this array gets re-allocated, and it never gets shrunk if the replica count decreases. If this is happening with essentially every new block, then there's an awful lot of excess memory being wasted on unused triplets. In a 200M block namenode, one excess triplet per block is 4.8GB!
        Hide
        Hairong Kuang added a comment -

        An initial patch for review. Will add a unit test and do more testing.

        Show
        Hairong Kuang added a comment - An initial patch for review. Will add a unit test and do more testing.
        Hide
        Konstantin Shvachko added a comment -

        I think this makes a lot of sense. Putting r2 into pending replication is correct as NN knows the replication (via pipeline) is in progress. This is exactly what is needed.

        Show
        Konstantin Shvachko added a comment - I think this makes a lot of sense. Putting r2 into pending replication is correct as NN knows the replication (via pipeline) is in progress. This is exactly what is needed.
        Hide
        Hairong Kuang added a comment -

        A block under construction keeps track of the pipeline. So NN knows the block's pipeline length, which is represented by r1+r2 in above algorithm.

        Show
        Hairong Kuang added a comment - A block under construction keeps track of the pipeline. So NN knows the block's pipeline length, which is represented by r1+r2 in above algorithm.
        Hide
        Hairong Kuang added a comment -

        I worked on a similar solution for our internal branch. Let me explain what I did. Assume that a block's replication factor is r. When a block under construction is changed to be complete, if it has r1 finalized replicas and r2 unfinalized replicas, NN puts r2 replicas into pending queue. If r1+r2<r, NN also puts the block into the neededreplication queue. Does this algorithm make sense?

        Show
        Hairong Kuang added a comment - I worked on a similar solution for our internal branch. Let me explain what I did. Assume that a block's replication factor is r. When a block under construction is changed to be complete, if it has r1 finalized replicas and r2 unfinalized replicas, NN puts r2 replicas into pending queue. If r1+r2<r, NN also puts the block into the neededreplication queue. Does this algorithm make sense?
        Hide
        Konstantin Shvachko added a comment -

        I think this should be controlled by "dfs.namenode.replication.interval". It is currently set to 3 secs. If DNs do not keep up with reporting blocks it should be increased.
        Putting blocks to pendingReplication feels like a trick, although it slows down replication of the last block.
        I think the right solution would be to add logic to processing of a failed pipeline. When this happens the client asks for a new generation stamp. At this point NN can make a note that this block will not have enough replicas. This will distinguish between blocks that have not been reported yet, and those that will never be reported. This is much more work.
        In practice I think tuning up the "replication.interval" parameter should be sufficient.

        Show
        Konstantin Shvachko added a comment - I think this should be controlled by "dfs.namenode.replication.interval". It is currently set to 3 secs. If DNs do not keep up with reporting blocks it should be increased. Putting blocks to pendingReplication feels like a trick, although it slows down replication of the last block. I think the right solution would be to add logic to processing of a failed pipeline. When this happens the client asks for a new generation stamp. At this point NN can make a note that this block will not have enough replicas. This will distinguish between blocks that have not been reported yet, and those that will never be reported. This is much more work. In practice I think tuning up the "replication.interval" parameter should be sufficient.
        Hide
        Boris Shkolnik added a comment -

        I agree that 5 minutes is too long, but putting it into pendingReplication still seems to be more appropriate. May be we can modify pendingReplication monitor to adjust check interval dynamically to the next 'timing out' replication. This would, of course, require having timeOut value per replication (or we can reuse timeStamp for that).

        Show
        Boris Shkolnik added a comment - I agree that 5 minutes is too long, but putting it into pendingReplication still seems to be more appropriate. May be we can modify pendingReplication monitor to adjust check interval dynamically to the next 'timing out' replication. This would, of course, require having timeOut value per replication (or we can reuse timeStamp for that).
        Hide
        dhruba borthakur added a comment -

        putting it in pendingReplication means that replication (when needed) will occur only after 5 minutes. This is a long time, isn't it? Maybe it is better to put it in neededReplication but (somehow) ensure that it is not attempted to be replicated until after a small delay.

        Show
        dhruba borthakur added a comment - putting it in pendingReplication means that replication (when needed) will occur only after 5 minutes. This is a long time, isn't it? Maybe it is better to put it in neededReplication but (somehow) ensure that it is not attempted to be replicated until after a small delay.
        Hide
        Boris Shkolnik added a comment -

        Does this patch looks like what has been discussed here?
        It puts underreplicated blocks into pending replication queue in case of newly created file.

        Show
        Boris Shkolnik added a comment - Does this patch looks like what has been discussed here? It puts underreplicated blocks into pending replication queue in case of newly created file.
        Hide
        Todd Lipcon added a comment -

        I think reusing PendingReplicationBlocks is probably the best idea so far - we already have confidence in that code, and should only be a very small patch.

        Show
        Todd Lipcon added a comment - I think reusing PendingReplicationBlocks is probably the best idea so far - we already have confidence in that code, and should only be a very small patch.
        Hide
        Hairong Kuang added a comment -

        > Primary DN send the blockReceived on account of all DNs.
        This will cause race condition: primary DN reports that block B is received at DN1 but after that NN receives a block report from DN1 that it does not have B.

        One option is that checkReplicationFactor(newFile) put the block in PendingReplicationBlocks queue instead of neededReplication queue since NN knows exactly from whom it is expecting blockReceived.

        Show
        Hairong Kuang added a comment - > Primary DN send the blockReceived on account of all DNs. This will cause race condition: primary DN reports that block B is received at DN1 but after that NN receives a block report from DN1 that it does not have B. One option is that checkReplicationFactor(newFile) put the block in PendingReplicationBlocks queue instead of neededReplication queue since NN knows exactly from whom it is expecting blockReceived.
        Hide
        Scott Carey added a comment -

        # Scott's solution of making the primary DN send the blockReceived on account of all DNs would work, but sounds complicated, expecially in the failure cases (eg what if the primary DN fails just before sending the RPC? Do we lose all the replicas? No good!)

        Yeah, its complicated. To simplify failure scenarios, leave the rest to be similar to the current state – the next regularly scheduled ping from a DN will provide the new block information, but the primary DN will still do its best to send all the block data it can gather so that the initial registration is as complete as possible. Perhaps the NN treats this extra information as provisional, until it gets a ping from the other DN's to confirm.

        Functionally, this won't differ much from Dhruba's proposition, and is more complicated.

        Show
        Scott Carey added a comment - # Scott's solution of making the primary DN send the blockReceived on account of all DNs would work, but sounds complicated, expecially in the failure cases (eg what if the primary DN fails just before sending the RPC? Do we lose all the replicas? No good!) Yeah, its complicated. To simplify failure scenarios, leave the rest to be similar to the current state – the next regularly scheduled ping from a DN will provide the new block information, but the primary DN will still do its best to send all the block data it can gather so that the initial registration is as complete as possible. Perhaps the NN treats this extra information as provisional, until it gets a ping from the other DN's to confirm. Functionally, this won't differ much from Dhruba's proposition, and is more complicated.
        Hide
        Todd Lipcon added a comment -

        Ah, very clever, Dhruba! I like that idea.

        Show
        Todd Lipcon added a comment - Ah, very clever, Dhruba! I like that idea.
        Hide
        dhruba borthakur added a comment -

        > UnderReplicatedBlocks could be augmented to carry a dontProcessUntil timestamp.

        To expand on this idea, we can delay replication of a block until a few seconds (configurable) after the modification time of the file. That could avoid storing an additional timestamp in UnderReplicatedBlocks.

        Show
        dhruba borthakur added a comment - > UnderReplicatedBlocks could be augmented to carry a dontProcessUntil timestamp. To expand on this idea, we can delay replication of a block until a few seconds (configurable) after the modification time of the file. That could avoid storing an additional timestamp in UnderReplicatedBlocks.
        Hide
        Todd Lipcon added a comment -

        I think there are a few solutions to this:

        • HDFS-611 should help a lot. We often have seen this issue after doing a largescale decrease in replication count, or a large directory removal, since the block deletions hold up the blockReceived call in DN.offerService. But this isn't a full solution - there are still other ways in which the DN can be slower at acking a new block than the client is in calling completeFile
        • Scott's solution of making the primary DN send the blockReceived on account of all DNs would work, but sounds complicated, expecially in the failure cases (eg what if the primary DN fails just before sending the RPC? Do we lose all the replicas? No good!)
        • UnderReplicatedBlocks could be augmented to carry a dontProcessUntil timestamp. When we check replication in response to a completeFile, we can mark the neededReplications with a "don't process until N seconds from now" which causes them to get skipped over by the replication monitor thread until a later time. This should give the DNs a bit of leeway to report the blocks, while not changing the control flow or distributed parts at all.

        Dhruba's workaround of upping min replication indeed helps, but as he said, it's at a great cost to the client, especially in the cases where it would help (eg if one DN is 10 seconds slow)

        Show
        Todd Lipcon added a comment - I think there are a few solutions to this: HDFS-611 should help a lot. We often have seen this issue after doing a largescale decrease in replication count, or a large directory removal, since the block deletions hold up the blockReceived call in DN.offerService. But this isn't a full solution - there are still other ways in which the DN can be slower at acking a new block than the client is in calling completeFile Scott's solution of making the primary DN send the blockReceived on account of all DNs would work, but sounds complicated, expecially in the failure cases (eg what if the primary DN fails just before sending the RPC? Do we lose all the replicas? No good!) UnderReplicatedBlocks could be augmented to carry a dontProcessUntil timestamp. When we check replication in response to a completeFile, we can mark the neededReplications with a "don't process until N seconds from now" which causes them to get skipped over by the replication monitor thread until a later time. This should give the DNs a bit of leeway to report the blocks, while not changing the control flow or distributed parts at all. Dhruba's workaround of upping min replication indeed helps, but as he said, it's at a great cost to the client, especially in the cases where it would help (eg if one DN is 10 seconds slow)
        Hide
        Scott Carey added a comment -

        I run with min replication = 2, yet see this all the time.

        In fact, based on that idea I might want to try min.replication = 1 to see if they become more or less frequent!

        Show
        Scott Carey added a comment - I run with min replication = 2, yet see this all the time. In fact, based on that idea I might want to try min.replication = 1 to see if they become more or less frequent!
        Hide
        dhruba borthakur added a comment -

        Can this be achieved by setting min.replication to something larger than the default value of 1? This means that the close call from the client will succeed only if the namenode has received confirmation from at least 'min.replication' number of replicas. (there could be performance overheads though)

        Show
        dhruba borthakur added a comment - Can this be achieved by setting min.replication to something larger than the default value of 1? This means that the close call from the client will succeed only if the namenode has received confirmation from at least 'min.replication' number of replicas. (there could be performance overheads though)
        Hide
        Scott Carey added a comment -

        Perhaps when the write pipeline completes, it should pass back the block information so that the initial commit to the NN can atomically add all the blocks.

        Example:

        DN's in pipe are DN1, DN2, DN3.

        A block is being written, the client writes to DN1, which writes to DN2, which writes to DN3. When DN3 completes, it notifies DN2 and provides its block replica information. When DN2 completes and has DN3's response, it passes its information, along with DN3's, to DN1. When DN1 completes, and has DN2's information along with DN3's, it reports to the NN the information about all 3 replicas, and lastly returns to the original client.

        This will have a few benefits:

        Fewer RPC's to the NN, and therefore less NN load.
        Atomic visibility of all replicas to the NN and clients.

        Show
        Scott Carey added a comment - Perhaps when the write pipeline completes, it should pass back the block information so that the initial commit to the NN can atomically add all the blocks. Example: DN's in pipe are DN1, DN2, DN3. A block is being written, the client writes to DN1, which writes to DN2, which writes to DN3. When DN3 completes, it notifies DN2 and provides its block replica information. When DN2 completes and has DN3's response, it passes its information, along with DN3's, to DN1. When DN1 completes, and has DN2's information along with DN3's, it reports to the NN the information about all 3 replicas, and lastly returns to the original client. This will have a few benefits: Fewer RPC's to the NN, and therefore less NN load. Atomic visibility of all replicas to the NN and clients.
        Hide
        Scott Carey added a comment -

        This doesn't cause major issues, but we do end up wasting a fair amount of disk and network resources.

        I guess it isn't 'major' but I get this all the time using Pig, it might be the same issue:

        It looks like a file is written and closed, then re-opened before the NN knows the pipeline is done.


        org.apache.pig.backend.executionengine.ExecException: ERROR 2135: Received error from store function.org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not replicated yet:/tmp/temp1164506480/tmp1316947817/_temporary/_attempt_201005212210_0961_m_000055_0/part-m-00055
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1268)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:469)
        at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:966)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:962)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:960)

        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getNext(POStore.java:151)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit.runPipeline(POSplit.java:254)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit.processPlan(POSplit.java:236)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit.processPlan(POSplit.java:241)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit.processPlan(POSplit.java:241)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit.getNext(POSplit.java:228)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:233)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:228)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)
        Caused by: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not replicated yet:/tmp/temp1164506480/tmp1316947817/_temporary/_attempt_201005212210_0961_m_000055_0/part-m-00055
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1268)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:469)
        at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)

        Show
        Scott Carey added a comment - This doesn't cause major issues, but we do end up wasting a fair amount of disk and network resources. I guess it isn't 'major' but I get this all the time using Pig, it might be the same issue: It looks like a file is written and closed, then re-opened before the NN knows the pipeline is done. org.apache.pig.backend.executionengine.ExecException: ERROR 2135: Received error from store function.org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not replicated yet:/tmp/temp1164506480/tmp1316947817/_temporary/_attempt_201005212210_0961_m_000055_0/part-m-00055 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1268) at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:469) at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:966) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:962) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:960) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getNext(POStore.java:151) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit.runPipeline(POSplit.java:254) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit.processPlan(POSplit.java:236) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit.processPlan(POSplit.java:241) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit.processPlan(POSplit.java:241) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit.getNext(POSplit.java:228) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:233) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:228) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) Caused by: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not replicated yet:/tmp/temp1164506480/tmp1316947817/_temporary/_attempt_201005212210_0961_m_000055_0/part-m-00055 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1268) at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:469) at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597)
        Hide
        Todd Lipcon added a comment -

        Particular sequence of events:

        1. client finishes writing to block with 3 replicas
        2. first DN happens to heartbeat, so addStoredBlock is called in the NN
        3. client calls completeFile, which calls checkReplicationFactor(newFile) when finalizing the INode
        4. NN adds block to pending replication
        5. Replication monitor runs and schedules two replications
        6. second and third pipeline DNs send their addStoredBlock notifications with their heartbeats
        7. Replications finish, and the new replicas report the new blocks as well
        8. NN notices the excess replicas and schedules deletion

        This doesn't cause major issues, but we do end up wasting a fair amount of disk and network resources.

        The question is why sometimes the immediate heartbeat on blockReceived doesn't trigger as it's supposed to.

        Show
        Todd Lipcon added a comment - Particular sequence of events: client finishes writing to block with 3 replicas first DN happens to heartbeat, so addStoredBlock is called in the NN client calls completeFile, which calls checkReplicationFactor(newFile) when finalizing the INode NN adds block to pending replication Replication monitor runs and schedules two replications second and third pipeline DNs send their addStoredBlock notifications with their heartbeats Replications finish, and the new replicas report the new blocks as well NN notices the excess replicas and schedules deletion This doesn't cause major issues, but we do end up wasting a fair amount of disk and network resources. The question is why sometimes the immediate heartbeat on blockReceived doesn't trigger as it's supposed to.

          People

          • Assignee:
            Unassigned
            Reporter:
            Todd Lipcon
          • Votes:
            1 Vote for this issue
            Watchers:
            27 Start watching this issue

            Dates

            • Created:
              Updated:

              Development