Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-4504

DFSOutputStream#close doesn't always release resources (such as leases)

    Details

    • Type: Bug Bug
    • Status: Patch Available
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      DFSOutputStream#close can throw an IOException in some cases. One example is if there is a pipeline error and then pipeline recovery fails. Unfortunately, in this case, some of the resources used by the DFSOutputStream are leaked. One particularly important resource is file leases.

      So it's possible for a long-lived HDFS client, such as Flume, to write many blocks to a file, but then fail to close it. Unfortunately, the LeaseRenewerThread inside the client will continue to renew the lease for the "undead" file. Future attempts to close the file will just rethrow the previous exception, and no progress can be made by the client.

      1. HDFS-4504.016.patch
        68 kB
        Colin Patrick McCabe
      2. HDFS-4504.015.patch
        70 kB
        Colin Patrick McCabe
      3. HDFS-4504.014.patch
        66 kB
        Colin Patrick McCabe
      4. HDFS-4504.011.patch
        33 kB
        Colin Patrick McCabe
      5. HDFS-4504.010.patch
        31 kB
        Colin Patrick McCabe
      6. HDFS-4504.009.patch
        30 kB
        Colin Patrick McCabe
      7. HDFS-4504.008.patch
        30 kB
        Colin Patrick McCabe
      8. HDFS-4504.007.patch
        28 kB
        Colin Patrick McCabe
      9. HDFS-4504.002.patch
        10 kB
        Colin Patrick McCabe
      10. HDFS-4504.001.patch
        12 kB
        Colin Patrick McCabe

        Issue Links

          Activity

          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12599081/HDFS-4504.016.patch
          against trunk revision .

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/6241//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12599081/HDFS-4504.016.patch against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/6241//console This message is automatically generated.
          Hide
          Fengdong Yu added a comment -

          Colin Patrick McCabe this was opened for a long time. please go back.

          can you also add these new configurable items to the hdfs-default.xml? Thanks.

          Show
          Fengdong Yu added a comment - Colin Patrick McCabe this was opened for a long time. please go back. can you also add these new configurable items to the hdfs-default.xml? Thanks.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12599081/HDFS-4504.016.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 12 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The following test timeouts occurred in hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.TestFileAppend3
          org.apache.hadoop.hdfs.TestHFlush

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4882//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4882//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12599081/HDFS-4504.016.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 12 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The following test timeouts occurred in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestFileAppend3 org.apache.hadoop.hdfs.TestHFlush +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4882//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4882//console This message is automatically generated.
          Hide
          Uma Maheswara Rao G added a comment -

          Thanks for thinking of this. Let me see if I can summarize the issue. If there is a streamer failure, and the DFSClient calls completeFile, the last block in the file will transition from state UNDER_CONSTRUCTION to state COMMITTED. This, in turn, will prevent later calls made by the client to recoverLease from working, since we only do block recovery on blocks in state UNDER_CONSTRUCTION or UNDER_RECOVERY. The ZombieStreamCloser will not be able to run block recovery either, for the same reason. Is that a fair summary?

          Yes.

          Really, the question is what is the right behavior in DFSOutputStream#close after a streamer failure? Calling completeFile(force=false) seems wrong. We need to perform block recovery in this scenario, as you said. Calling completeFile(force=true) will start block recovery (it calls FSNamesystem#internalReleaseLease}}. That seems like the right thing to do.

          Where we are not sure that clinet received last packet ack, we should not call completeFile. here complete file doing commit block thinking clinet already got ack for last packet that means DN also would have finalized and for sure it will report in some time. So, in such cases we should not go with completeFile and should do recover file lease some how that should initiate finalization at DN also. Please not that we tweak here for falling into that case what Todd pointed earlier. May be better thing to check holder name. if file holder is current holder, then only we should do recover file lease with new API.

          It might make sense to create a new RPC with a different name than {[completeFile}}, to avoid confusion with the other function of completeFile. But fundamentally, starting block recovery is what we need to do here, and we might as well do it from DFSOutputStream#close. I think this will solve the problem.

          I think it may solve, But IMO, more simpler thing would be to just reassign lease holder at NN with some name for this Zombie streams. In this case, NN will take care of recovering them correctly. current clients renewLease will not renew the lease for this files. We can think once on this option for more simplicity and less risk I feel.
          ZombieStreameManger should just ensure it has informed successfully to NN about Zombie stream instead of calling complete and others things can be same. We can think more if any other impacts with this.

          Currently I think(guess, but need to look once on this) if same holder is trying to recoverLease from client, it may not allow as same client is trying to recover where same client was the holder for that file. If yes, we need to allow this with above proposal by some indication.

          Show
          Uma Maheswara Rao G added a comment - Thanks for thinking of this. Let me see if I can summarize the issue. If there is a streamer failure, and the DFSClient calls completeFile, the last block in the file will transition from state UNDER_CONSTRUCTION to state COMMITTED. This, in turn, will prevent later calls made by the client to recoverLease from working, since we only do block recovery on blocks in state UNDER_CONSTRUCTION or UNDER_RECOVERY. The ZombieStreamCloser will not be able to run block recovery either, for the same reason. Is that a fair summary? Yes. Really, the question is what is the right behavior in DFSOutputStream#close after a streamer failure? Calling completeFile(force=false) seems wrong. We need to perform block recovery in this scenario, as you said. Calling completeFile(force=true) will start block recovery (it calls FSNamesystem#internalReleaseLease}}. That seems like the right thing to do. Where we are not sure that clinet received last packet ack, we should not call completeFile. here complete file doing commit block thinking clinet already got ack for last packet that means DN also would have finalized and for sure it will report in some time. So, in such cases we should not go with completeFile and should do recover file lease some how that should initiate finalization at DN also. Please not that we tweak here for falling into that case what Todd pointed earlier. May be better thing to check holder name. if file holder is current holder, then only we should do recover file lease with new API. It might make sense to create a new RPC with a different name than {[completeFile}}, to avoid confusion with the other function of completeFile. But fundamentally, starting block recovery is what we need to do here, and we might as well do it from DFSOutputStream#close. I think this will solve the problem. I think it may solve, But IMO, more simpler thing would be to just reassign lease holder at NN with some name for this Zombie streams. In this case, NN will take care of recovering them correctly. current clients renewLease will not renew the lease for this files. We can think once on this option for more simplicity and less risk I feel. ZombieStreameManger should just ensure it has informed successfully to NN about Zombie stream instead of calling complete and others things can be same. We can think more if any other impacts with this. Currently I think(guess, but need to look once on this) if same holder is trying to recoverLease from client, it may not allow as same client is trying to recover where same client was the holder for that file. If yes, we need to allow this with above proposal by some indication.
          Hide
          Colin Patrick McCabe added a comment -

          Thanks for thinking of this. Let me see if I can summarize the issue. If there is a streamer failure, and the DFSClient calls completeFile, the last block in the file will transition from state UNDER_CONSTRUCTION to state COMMITTED. This, in turn, will prevent later calls made by the client to recoverLease from working, since we only do block recovery on blocks in state UNDER_CONSTRUCTION or UNDER_RECOVERY. The ZombieStreamCloser will not be able to run block recovery either, for the same reason. Is that a fair summary?

          Really, the question is what is the right behavior in DFSOutputStream#close after a streamer failure? Calling completeFile(force=false) seems wrong. We need to perform block recovery in this scenario, as you said. Calling completeFile(force=true) will start block recovery (it calls FSNamesystem#internalReleaseLease}}. That seems like the right thing to do.

          It might make sense to create a new RPC with a different name than {[completeFile}}, to avoid confusion with the other function of completeFile. But fundamentally, starting block recovery is what we need to do here, and we might as well do it from DFSOutputStream#close. I think this will solve the problem.

          Show
          Colin Patrick McCabe added a comment - Thanks for thinking of this. Let me see if I can summarize the issue. If there is a streamer failure, and the DFSClient calls completeFile , the last block in the file will transition from state UNDER_CONSTRUCTION to state COMMITTED . This, in turn, will prevent later calls made by the client to recoverLease from working, since we only do block recovery on blocks in state UNDER_CONSTRUCTION or UNDER_RECOVERY . The ZombieStreamCloser will not be able to run block recovery either, for the same reason. Is that a fair summary? Really, the question is what is the right behavior in DFSOutputStream#close after a streamer failure? Calling completeFile(force=false) seems wrong. We need to perform block recovery in this scenario, as you said. Calling completeFile(force=true) will start block recovery (it calls FSNamesystem#internalReleaseLease}}. That seems like the right thing to do. It might make sense to create a new RPC with a different name than {[completeFile}}, to avoid confusion with the other function of completeFile . But fundamentally, starting block recovery is what we need to do here, and we might as well do it from DFSOutputStream#close . I think this will solve the problem.
          Hide
          Vinayakumar B added a comment -

          A DFSOutputStream is a zombie for one of two reasons:
          1. The client can't contact the NameNode (perhaps because of a network problem)
          2. The client asked the NameNode to complete the file and it refused, because the NN does not (yet?) have a record that all of the file's blocks are present and complete.

          You cannot ignore DataStreamer failure as not zombie. Thats the potential one and can happen frequently too. As already described in the above test, trying to close file, in case of DataStreamer failure, could lead to potential problem. Force complete also fails though.

          As I said before, the current code doesn't do anything special in the case of a data streamer failure in DFSOutputStream#close. It just throws up its hands and says "oh well, guess that data's gone!" After the hard-lease period expires, we will complete the file anyway. So it's exactly the same behavior with this patch as without it-- only the timeout is different.

          In case of DataStreamer failure due to pipeline failure,
          Without patch, complete() call wont be called, so no changes of block state at NN side, hence recovery of the file will succeed. No data will be lost
          With patch, complete() call marks the block state to COMMITTED, which will block recovery/force complete until block is reported by DN, which will not. So data lost

          This might be a good idea, but we should do it in a future JIRA. This patch is big enough, and changes enough things already.

          Yes, thats correct. To avoid too many changes in this patch itself, we suggesting to just try to report zombie to NN instead of force complete. This covers all cases you mentioned.

          Show
          Vinayakumar B added a comment - A DFSOutputStream is a zombie for one of two reasons: 1. The client can't contact the NameNode (perhaps because of a network problem) 2. The client asked the NameNode to complete the file and it refused, because the NN does not (yet?) have a record that all of the file's blocks are present and complete. You cannot ignore DataStreamer failure as not zombie. Thats the potential one and can happen frequently too. As already described in the above test, trying to close file, in case of DataStreamer failure, could lead to potential problem. Force complete also fails though. As I said before, the current code doesn't do anything special in the case of a data streamer failure in DFSOutputStream#close. It just throws up its hands and says "oh well, guess that data's gone!" After the hard-lease period expires, we will complete the file anyway. So it's exactly the same behavior with this patch as without it-- only the timeout is different. In case of DataStreamer failure due to pipeline failure, Without patch, complete() call wont be called, so no changes of block state at NN side, hence recovery of the file will succeed. No data will be lost With patch, complete() call marks the block state to COMMITTED, which will block recovery/force complete until block is reported by DN, which will not. So data lost This might be a good idea, but we should do it in a future JIRA. This patch is big enough, and changes enough things already. Yes, thats correct. To avoid too many changes in this patch itself, we suggesting to just try to report zombie to NN instead of force complete. This covers all cases you mentioned.
          Hide
          Uma Maheswara Rao G added a comment - - edited

          Looks like this is good problem scenario pointed by Vinay here.
          Client should not call complete without knowing that he receives ack for last packet. Otherwise DN might be there with RBW state block and NN block state may become committed as client called to complete. Now states will become inconsistent and internalRecoverLease also will not be allowed because committedBlock did not get minreplicas reported by DN.

           case COMMITTED:
                // Close file if committed blocks are minimally replicated
                if(penultimateBlockMinReplication &&
                    blockManager.checkMinReplication(lastBlock)) {
                  finalizeINodeFileUnderConstruction(src, pendingFile,
                      iip.getLatestSnapshot(), false);
                  NameNode.stateChangeLog.warn("BLOCK*"
                    + " internalReleaseLease: Committed blocks are minimally replicated,"
                    + " lease removed, file closed.");
                  return true;  // closed!
                }
                // Cannot close file right now, since some blocks 
                // are not yet minimally replicated.
                // This may potentially cause infinite loop in lease recovery
                // if there are no valid replicas on data-nodes.
                String message = "DIR* NameSystem.internalReleaseLease: " +
                    "Failed to release lease for file " + src +
                    ". Committed blocks are waiting to be minimally replicated." +
                    " Try again later.";
                NameNode.stateChangeLog.warn(message);
          

          ideally block is committed means, DN must have finalyzed also. SO, DN will report finalysed block state in that case. Here, DN has RBW state only as it was failed in between. Due to that failure, it got added to zombie and it will try to complete the file without knowing whether he receives really last packet ack or not.

          In normal recovery case, block will be finalysed by normal recovery flow as below:

             case UNDER_CONSTRUCTION:
             case UNDER_RECOVERY:
                final BlockInfoUnderConstruction uc = (BlockInfoUnderConstruction)lastBlock;
                // setup the last block locations from the blockManager if not known
                if (uc.getNumExpectedLocations() == 0) {
                  uc.setExpectedLocations(blockManager.getNodes(lastBlock));
                }
                // start recovery of the last block for this file
                long blockRecoveryId = nextGenerationStamp(isLegacyBlock(uc));
                lease = reassignLease(lease, src, recoveryLeaseHolder, pendingFile);
                uc.initializeBlockRecovery(blockRecoveryId);
                leaseManager.renewLease(lease);
                // Cannot close file right now, since the last block requires recovery.
                // This may potentially cause infinite loop in lease recovery
                // if there are no valid replicas on data-nodes.
                NameNode.stateChangeLog.warn(
                          "DIR* NameSystem.internalReleaseLease: " +
                          "File " + src + " has not been closed." +
                         " Lease recovery is in progress. " +
                          "RecoveryId = " + blockRecoveryId + " for block " + lastBlock);
                break;
              }
          

          this will make DN blocks to finalyze if they are in RBW state. But here if we change the state already to committed, then recovery flow will be diverted and no one will finalyze the block at DN. I am affraid that, this changes may cause the problems like this. So, better to do recovery with NN only I think by just just informing the zombie files to NN when NN available. Once we inform to NN successfully about zombie file successfully then we can remove such file entries from his list. untill that try informing to NN about zombie files. This may be better choice which may avoid the risks like above scenarios.

          Show
          Uma Maheswara Rao G added a comment - - edited Looks like this is good problem scenario pointed by Vinay here. Client should not call complete without knowing that he receives ack for last packet. Otherwise DN might be there with RBW state block and NN block state may become committed as client called to complete. Now states will become inconsistent and internalRecoverLease also will not be allowed because committedBlock did not get minreplicas reported by DN. case COMMITTED: // Close file if committed blocks are minimally replicated if (penultimateBlockMinReplication && blockManager.checkMinReplication(lastBlock)) { finalizeINodeFileUnderConstruction(src, pendingFile, iip.getLatestSnapshot(), false ); NameNode.stateChangeLog.warn( "BLOCK*" + " internalReleaseLease: Committed blocks are minimally replicated," + " lease removed, file closed." ); return true ; // closed! } // Cannot close file right now, since some blocks // are not yet minimally replicated. // This may potentially cause infinite loop in lease recovery // if there are no valid replicas on data-nodes. String message = "DIR* NameSystem.internalReleaseLease: " + "Failed to release lease for file " + src + ". Committed blocks are waiting to be minimally replicated." + " Try again later." ; NameNode.stateChangeLog.warn(message); ideally block is committed means, DN must have finalyzed also. SO, DN will report finalysed block state in that case. Here, DN has RBW state only as it was failed in between. Due to that failure, it got added to zombie and it will try to complete the file without knowing whether he receives really last packet ack or not. In normal recovery case, block will be finalysed by normal recovery flow as below: case UNDER_CONSTRUCTION: case UNDER_RECOVERY: final BlockInfoUnderConstruction uc = (BlockInfoUnderConstruction)lastBlock; // setup the last block locations from the blockManager if not known if (uc.getNumExpectedLocations() == 0) { uc.setExpectedLocations(blockManager.getNodes(lastBlock)); } // start recovery of the last block for this file long blockRecoveryId = nextGenerationStamp(isLegacyBlock(uc)); lease = reassignLease(lease, src, recoveryLeaseHolder, pendingFile); uc.initializeBlockRecovery(blockRecoveryId); leaseManager.renewLease(lease); // Cannot close file right now, since the last block requires recovery. // This may potentially cause infinite loop in lease recovery // if there are no valid replicas on data-nodes. NameNode.stateChangeLog.warn( "DIR* NameSystem.internalReleaseLease: " + "File " + src + " has not been closed." + " Lease recovery is in progress. " + "RecoveryId = " + blockRecoveryId + " for block " + lastBlock); break ; } this will make DN blocks to finalyze if they are in RBW state. But here if we change the state already to committed, then recovery flow will be diverted and no one will finalyze the block at DN. I am affraid that, this changes may cause the problems like this. So, better to do recovery with NN only I think by just just informing the zombie files to NN when NN available. Once we inform to NN successfully about zombie file successfully then we can remove such file entries from his list. untill that try informing to NN about zombie files. This may be better choice which may avoid the risks like above scenarios.
          Hide
          Vinayakumar B added a comment -

          Please check this test

            @Test
            public void testPipelineFailureWithZombie() throws Exception {
              Configuration conf = new Configuration();
              conf.setInt(DFSConfigKeys.DFS_BLOCK_SIZE_KEY, BLOCK_SIZE);
              conf.setInt(DFSConfigKeys.DFS_CLIENT_CLOSE_TIMEOUT_MS, 5000);
              MiniDFSCluster cluster = new MiniDFSCluster.Builder(conf).numDataNodes(1)
                  .build();
              DistributedFileSystem fs = cluster.getFileSystem();
              FSDataOutputStream fos = fs.create(new Path("/test"));
              boolean closed = false;
              DataNodeProperties dn = null;
              try {
                fos.writeBytes("Hello");
                fos.hflush();
                dn = cluster.stopDataNode(0);
                fos.writeBytes("Hello again");
                fos.close();
                closed=true;
              }catch(Exception e){
                // Ignore as of now
              }
              finally {
                try {
                  fos.close();
                  closed=true;
                } catch (IOException e) {
                  // Ignore as close will not be able to complete
                }
              }
              if (!closed) {
                // just to check the activity by ZombieStreamManager
                Thread.sleep(10000);
                cluster.restartDataNode(dn, true);
                Thread.sleep(Long.MAX_VALUE);
              }
            }

          In this case, Since the streamer is not made null, then complete call is called with lastBlock, which has changed the state of the Block to COMMITTED and expects minReplication.
          So this will never be satisfied, and also force complete and recoverLease also fails saying minReplication for COMMITTED blocks not met.

          Show
          Vinayakumar B added a comment - Please check this test @Test public void testPipelineFailureWithZombie() throws Exception { Configuration conf = new Configuration(); conf.setInt(DFSConfigKeys.DFS_BLOCK_SIZE_KEY, BLOCK_SIZE); conf.setInt(DFSConfigKeys.DFS_CLIENT_CLOSE_TIMEOUT_MS, 5000); MiniDFSCluster cluster = new MiniDFSCluster.Builder(conf).numDataNodes(1) .build(); DistributedFileSystem fs = cluster.getFileSystem(); FSDataOutputStream fos = fs.create( new Path( "/test" )); boolean closed = false ; DataNodeProperties dn = null ; try { fos.writeBytes( "Hello" ); fos.hflush(); dn = cluster.stopDataNode(0); fos.writeBytes( "Hello again" ); fos.close(); closed= true ; } catch (Exception e){ // Ignore as of now } finally { try { fos.close(); closed= true ; } catch (IOException e) { // Ignore as close will not be able to complete } } if (!closed) { // just to check the activity by ZombieStreamManager Thread .sleep(10000); cluster.restartDataNode(dn, true ); Thread .sleep( Long .MAX_VALUE); } } In this case, Since the streamer is not made null, then complete call is called with lastBlock, which has changed the state of the Block to COMMITTED and expects minReplication. So this will never be satisfied, and also force complete and recoverLease also fails saying minReplication for COMMITTED blocks not met.
          Hide
          Colin Patrick McCabe added a comment -

          This revision fixes the "attempt to close null datastreamer" problem that Vinay spotted

          Show
          Colin Patrick McCabe added a comment - This revision fixes the "attempt to close null datastreamer" problem that Vinay spotted
          Hide
          Colin Patrick McCabe added a comment -

          Vinay wrote:

          While handling the Zombie stream, ZombieStreamManager can report to NameNode via some new RPC as this stream is zombie.

          A DFSOutputStream is a zombie for one of two reasons:
          1. The client can't contact the NameNode (perhaps because of a network problem)
          2. The client asked the NameNode to complete the file and it refused, because the NN does not (yet?) have a record that all of the file's blocks are present and complete.

          In scenario #1, we can't tell the NameNode anything because we can't talk to it.

          In scenario #2, the NameNode already knows everything it needs to know about the file. It doesn't care whether we consider the file a zombie or not-- why would it? All it knows is that the file isn't complete yet.

          The big picture for this change is that we're trying to prevent a scenario where the DFSOutputStream is never closeable and leaks resources forever. In order to do that, we sometimes have to make some unpleasant choices. One of them is that if there was a data streamer failure, we complete the file anyway after a configurable time period (currently 2 minutes). If you don't like this policy, you can just set the period so long that it corresponds to the lease recovery period.

          As I said before, the current code doesn't do anything special in the case of a data streamer failure in DFSOutputStream#close. It just throws up its hands and says "oh well, guess that data's gone!" After the hard-lease period expires, we will complete the file anyway. So it's exactly the same behavior with this patch as without it-- only the timeout is different.

          It sounds like what you want to do is somehow "try harder" to fix the data streamer failure when you know the file is being closed. This might be a good idea, but we should do it in a future JIRA. This patch is big enough, and changes enough things already.

          Show
          Colin Patrick McCabe added a comment - Vinay wrote: While handling the Zombie stream, ZombieStreamManager can report to NameNode via some new RPC as this stream is zombie. A DFSOutputStream is a zombie for one of two reasons: 1. The client can't contact the NameNode (perhaps because of a network problem) 2. The client asked the NameNode to complete the file and it refused, because the NN does not (yet?) have a record that all of the file's blocks are present and complete. In scenario #1, we can't tell the NameNode anything because we can't talk to it. In scenario #2, the NameNode already knows everything it needs to know about the file. It doesn't care whether we consider the file a zombie or not-- why would it? All it knows is that the file isn't complete yet. The big picture for this change is that we're trying to prevent a scenario where the DFSOutputStream is never closeable and leaks resources forever. In order to do that, we sometimes have to make some unpleasant choices. One of them is that if there was a data streamer failure, we complete the file anyway after a configurable time period (currently 2 minutes). If you don't like this policy, you can just set the period so long that it corresponds to the lease recovery period. As I said before, the current code doesn't do anything special in the case of a data streamer failure in DFSOutputStream#close. It just throws up its hands and says "oh well, guess that data's gone!" After the hard-lease period expires, we will complete the file anyway. So it's exactly the same behavior with this patch as without it-- only the timeout is different. It sounds like what you want to do is somehow "try harder" to fix the data streamer failure when you know the file is being closed. This might be a good idea, but we should do it in a future JIRA. This patch is big enough, and changes enough things already.
          Hide
          Vinayakumar B added a comment -

          In a lot of cases when close fails, the NameNode is not reachable. The behavior I implemented in the patch is designed to let long-running clients handle these transient problems gracefully. Currently, uncloseable files get created, and a client restart is needed to get rid of them. For example, you might have to restart your Flume daemon, your NFS gateway, etc. The client wants to be able to tell when the file is actually closed in case it needs to reopen or move it. Currently, there is no way to do this. With this patch, it can do this by continuing to call close until it no longer throws an exception.

          Anyway, for gracefully closing the file also namenode should be reachable even after file is considered as zombie. So Uma and me discussed about the following possible scenarios and we came to the conclusion mentioned in the above comment.

          In case of pipeline failure, all datanodes may not be complete and they will not report to namenode. So even graceful closure of the file may result in failure in such cases. Recovery of the lastblock in all those datanodes required, then only proper closure of the file can be done. One way to do this is to call recoverLease, but as todd pointed out, this also may lead to some other problem.

          As you told, for the long living clients, such as Flume, etc. recovery of these files will be done only in case of client restart.

          So what we are trying to propose here is,
          At client side, instead of trying for the graceful closure of the file in ZombieStreamManager or DFSOS.close() which may fail everytime,

          What we are trying propose here is as follows

          1. In ZombieStreamManager, periodically check for the streams from DFSClient.filesBeingWritten which are marked as closed, but still not removed from this map. Obviously these will be Zombie streams. This period can be reasonable.
          2. While handling the Zombie stream, ZombieStreamManager can report to NameNode via some new RPC as this stream is zombie. Then NN can check and re-assign the lease if the client is still having the lease on that file. Here target lease can be anything.. So that after hardlimit expiry, NN will automatically recover the file.
          3. So once the reporting zombie stream to NN is success, then DFSOS can be removed from the DFSClient.filesBeingWritten map.

          So this way, we can avoid more changes in the patch.

          It seems like the case you are concerned about is the case where we fail to get the last block, because of a streamer failure. This is already a problem, and I don't think this patch makes it worse (although it doesn't make it better, either)

          I agree, but before your patch, close() call was never allowed till we get NPE. So there was no issue.

          Show
          Vinayakumar B added a comment - In a lot of cases when close fails, the NameNode is not reachable. The behavior I implemented in the patch is designed to let long-running clients handle these transient problems gracefully. Currently, uncloseable files get created, and a client restart is needed to get rid of them. For example, you might have to restart your Flume daemon, your NFS gateway, etc. The client wants to be able to tell when the file is actually closed in case it needs to reopen or move it. Currently, there is no way to do this. With this patch, it can do this by continuing to call close until it no longer throws an exception. Anyway, for gracefully closing the file also namenode should be reachable even after file is considered as zombie. So Uma and me discussed about the following possible scenarios and we came to the conclusion mentioned in the above comment. In case of pipeline failure, all datanodes may not be complete and they will not report to namenode. So even graceful closure of the file may result in failure in such cases. Recovery of the lastblock in all those datanodes required, then only proper closure of the file can be done. One way to do this is to call recoverLease , but as todd pointed out, this also may lead to some other problem. As you told, for the long living clients, such as Flume, etc. recovery of these files will be done only in case of client restart. So what we are trying to propose here is, At client side, instead of trying for the graceful closure of the file in ZombieStreamManager or DFSOS.close() which may fail everytime, What we are trying propose here is as follows In ZombieStreamManager, periodically check for the streams from DFSClient.filesBeingWritten which are marked as closed, but still not removed from this map. Obviously these will be Zombie streams. This period can be reasonable. While handling the Zombie stream, ZombieStreamManager can report to NameNode via some new RPC as this stream is zombie. Then NN can check and re-assign the lease if the client is still having the lease on that file. Here target lease can be anything.. So that after hardlimit expiry, NN will automatically recover the file. So once the reporting zombie stream to NN is success, then DFSOS can be removed from the DFSClient.filesBeingWritten map. So this way, we can avoid more changes in the patch. It seems like the case you are concerned about is the case where we fail to get the last block, because of a streamer failure. This is already a problem, and I don't think this patch makes it worse (although it doesn't make it better, either) I agree, but before your patch, close() call was never allowed till we get NPE. So there was no issue.
          Hide
          Colin Patrick McCabe added a comment -

          In the latest patch, some unnecessary (Only space) file changes from Mapreduce, tools and Yarn project got added. I assume these added by mistake. Can you please remove them..

          I did that because I wanted a test run on all projects. I will remove them for the next patch, since the other projects came up fine with this.

          I think above peice of code can be problematic in case of hflush failure + close call. on sync failure, closeThreads called and streamer becomes null there. closed flag also will marked here.

          Thanks, that's a good catch. I will check that streamer is not null in closeThreads.

          How about simply informing NN about zombie situation for a file...

          In a lot of cases when close fails, the NameNode is not reachable. The behavior I implemented in the patch is designed to let long-running clients handle these transient problems gracefully. Currently, uncloseable files get created, and a client restart is needed to get rid of them. For example, you might have to restart your Flume daemon, your NFS gateway, etc. The client wants to be able to tell when the file is actually closed in case it needs to reopen or move it. Currently, there is no way to do this. With this patch, it can do this by continuing to call close until it no longer throws an exception.

          It seems like the case you are concerned about is the case where we fail to get the last block, because of a streamer failure. This is already a problem, and I don't think this patch makes it worse (although it doesn't make it better, either). If you have ideas for how to improve this case, maybe we should file a follow-on JIRA?

          Show
          Colin Patrick McCabe added a comment - In the latest patch, some unnecessary (Only space) file changes from Mapreduce, tools and Yarn project got added. I assume these added by mistake. Can you please remove them.. I did that because I wanted a test run on all projects. I will remove them for the next patch, since the other projects came up fine with this. I think above peice of code can be problematic in case of hflush failure + close call. on sync failure, closeThreads called and streamer becomes null there. closed flag also will marked here. Thanks, that's a good catch. I will check that streamer is not null in closeThreads. How about simply informing NN about zombie situation for a file... In a lot of cases when close fails, the NameNode is not reachable. The behavior I implemented in the patch is designed to let long-running clients handle these transient problems gracefully. Currently, uncloseable files get created, and a client restart is needed to get rid of them. For example, you might have to restart your Flume daemon, your NFS gateway, etc. The client wants to be able to tell when the file is actually closed in case it needs to reopen or move it. Currently, there is no way to do this. With this patch, it can do this by continuing to call close until it no longer throws an exception. It seems like the case you are concerned about is the case where we fail to get the last block, because of a streamer failure. This is already a problem, and I don't think this patch makes it worse (although it doesn't make it better, either). If you have ideas for how to improve this case, maybe we should file a follow-on JIRA?
          Hide
          Uma Maheswara Rao G added a comment - - edited

          Hi Colin, Nice work on this issue.

           List<IOException> ioExceptions = new LinkedList<IOException>();
              if (!closed) {
                try {
                  flushBuffer();       // flush from all upper layers
            
                  if (currentPacket != null) { 
                    waitAndQueueCurrentPacket();
                  }
            
                  if (bytesCurBlock != 0) {
                    // send an empty packet to mark the end of the block
                    currentPacket = new Packet(0, 0, bytesCurBlock, 
                        currentSeqno++, this.checksum.getChecksumSize());
                    currentPacket.lastPacketInBlock = true;
                    currentPacket.syncBlock = shouldSyncBlock;
                  }
            
                  flushInternal();             // flush all data to Datanodes
                } catch (IOException e) {
                  DFSClient.LOG.error("unable to flush buffers during file close " +
                        "for " + src, e);
                  ioExceptions.add(e);
                } finally {
                  closed = true;
                }
              }
              // get last block before destroying the streamer
              ExtendedBlock lastBlock = streamer.getBlock();
              closeThreads(false);
          

          I think above peice of code can be problematic in case of hflush failure + close call.
          on sync failure, closeThreads called and streamer becomes null there. closed flag also will marked here.
          When user calls close, unconditionally we will try to closeThreads again and also we are trying to get lastblock from streamer.

          I think in pipeline failure case, if we don't get last block(because of streamer closure in pipeline failure), force closing may not be a good choice as if we don't get last block correctly from client.

          Me and Vinay was thinking on this issue. How about simply informing NN about zombie situation for a file and change that client holder name to ZombieFile(intension is just make sure client is not renewing unintended files)? so, that ensure renewLease will not renew such files and closing will happen normally as NN does before viq hardlimit expiry. or renewLease call tell the ZombieFiles list which should be skipped from renewing from this client.

          Show
          Uma Maheswara Rao G added a comment - - edited Hi Colin, Nice work on this issue. List<IOException> ioExceptions = new LinkedList<IOException>(); if (!closed) { try { flushBuffer(); // flush from all upper layers if (currentPacket != null ) { waitAndQueueCurrentPacket(); } if (bytesCurBlock != 0) { // send an empty packet to mark the end of the block currentPacket = new Packet(0, 0, bytesCurBlock, currentSeqno++, this .checksum.getChecksumSize()); currentPacket.lastPacketInBlock = true ; currentPacket.syncBlock = shouldSyncBlock; } flushInternal(); // flush all data to Datanodes } catch (IOException e) { DFSClient.LOG.error( "unable to flush buffers during file close " + " for " + src, e); ioExceptions.add(e); } finally { closed = true ; } } // get last block before destroying the streamer ExtendedBlock lastBlock = streamer.getBlock(); closeThreads( false ); I think above peice of code can be problematic in case of hflush failure + close call. on sync failure, closeThreads called and streamer becomes null there. closed flag also will marked here. When user calls close, unconditionally we will try to closeThreads again and also we are trying to get lastblock from streamer. I think in pipeline failure case, if we don't get last block(because of streamer closure in pipeline failure), force closing may not be a good choice as if we don't get last block correctly from client. Me and Vinay was thinking on this issue. How about simply informing NN about zombie situation for a file and change that client holder name to ZombieFile(intension is just make sure client is not renewing unintended files)? so, that ensure renewLease will not renew such files and closing will happen normally as NN does before viq hardlimit expiry. or renewLease call tell the ZombieFiles list which should be skipped from renewing from this client.
          Hide
          Vinayakumar B added a comment -

          In the latest patch, some unnecessary (Only space) file changes from Mapreduce, tools and Yarn project got added. I assume these added by mistake. Can you please remove them..

          Show
          Vinayakumar B added a comment - In the latest patch, some unnecessary (Only space) file changes from Mapreduce, tools and Yarn project got added. I assume these added by mistake. Can you please remove them..
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12598294/HDFS-4504.015.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 12 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The following test timeouts occurred in hadoop-hdfs-project/hadoop-hdfs hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-tools/hadoop-distcp hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client:

          org.apache.hadoop.hdfs.TestHFlush

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4834//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4834//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12598294/HDFS-4504.015.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 12 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The following test timeouts occurred in hadoop-hdfs-project/hadoop-hdfs hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-tools/hadoop-distcp hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client: org.apache.hadoop.hdfs.TestHFlush +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4834//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4834//console This message is automatically generated.
          Hide
          Colin Patrick McCabe added a comment -

          A few tests needed to be fixed, since they were assuming that a single IOException would cause DFSOutputStream#close to give up immediately, whereas that is no longer the case. Setting DFSConfigKeys.DFS_CLIENT_CLOSE_TIMEOUT_MS to a low value allows the tests to complete in a reasonable amount of time.

          FSNamesystem#completeInternal(force=true) needed to call internalReleaseLease to properly finalize the last block in the file. Failure to do this could put us in a situation where the NameNode thinks the block is finalized, but the DataNode(s) do not.

          Show
          Colin Patrick McCabe added a comment - A few tests needed to be fixed, since they were assuming that a single IOException would cause DFSOutputStream#close to give up immediately, whereas that is no longer the case. Setting DFSConfigKeys.DFS_CLIENT_CLOSE_TIMEOUT_MS to a low value allows the tests to complete in a reasonable amount of time. FSNamesystem#completeInternal(force=true) needed to call internalReleaseLease to properly finalize the last block in the file. Failure to do this could put us in a situation where the NameNode thinks the block is finalized, but the DataNode(s) do not.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12598096/HDFS-4504.014.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 10 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-tools/hadoop-distcp hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client:

          org.apache.hadoop.hdfs.server.namenode.TestSaveNamespace
          org.apache.hadoop.hdfs.server.namenode.ha.TestHAStateTransitions
          org.apache.hadoop.hdfs.TestHdfsClose
          org.apache.hadoop.hdfs.TestFileAppend4
          org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover

          The following test timeouts occurred in hadoop-hdfs-project/hadoop-hdfs hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-tools/hadoop-distcp hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client:

          org.apache.hadoop.hdfs.TestFileAppend3

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4827//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4827//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12598096/HDFS-4504.014.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 10 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-tools/hadoop-distcp hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client: org.apache.hadoop.hdfs.server.namenode.TestSaveNamespace org.apache.hadoop.hdfs.server.namenode.ha.TestHAStateTransitions org.apache.hadoop.hdfs.TestHdfsClose org.apache.hadoop.hdfs.TestFileAppend4 org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover The following test timeouts occurred in hadoop-hdfs-project/hadoop-hdfs hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-tools/hadoop-distcp hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client: org.apache.hadoop.hdfs.TestFileAppend3 +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4827//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4827//console This message is automatically generated.
          Hide
          Colin Patrick McCabe added a comment -

          The latest patch:

          • when reaping zombie files, don't use recoverLease. Instead, add a "force" flag to completeFile.
          • add dfs.client.close.timeout.ms, to specify how long we should wait inside close() before making the file a zombie. Previously, we used ipc.ping.interval to determine how long to wait. Having a configuration option for this makes a lot of unit tests that want to test "close + unresponsive namenode" much simpler to do.
          • FSNamesystem#completeFile should issue a different log message on failure than on success.
          • TestHFlush#testHFlushInterrupted: Thread#interrupted is a static function; refer to it statically to avoid Java warning. Clear interrupted status when appropriate.

          Since this is a bigger change, I added small whitespace changes in hadoop-mapreduce-client, hadoop-yarn, and hadoop-tools to get a full test run, so that we can become aware of any issues.

          Show
          Colin Patrick McCabe added a comment - The latest patch: when reaping zombie files, don't use recoverLease. Instead, add a "force" flag to completeFile. add dfs.client.close.timeout.ms , to specify how long we should wait inside close() before making the file a zombie. Previously, we used ipc.ping.interval to determine how long to wait. Having a configuration option for this makes a lot of unit tests that want to test "close + unresponsive namenode" much simpler to do. FSNamesystem#completeFile should issue a different log message on failure than on success. TestHFlush#testHFlushInterrupted : Thread#interrupted is a static function; refer to it statically to avoid Java warning. Clear interrupted status when appropriate. Since this is a bigger change, I added small whitespace changes in hadoop-mapreduce-client, hadoop-yarn, and hadoop-tools to get a full test run, so that we can become aware of any issues.
          Hide
          Colin Patrick McCabe added a comment -

          I don't think adding a new RPC would be too bad. It would be very similar to recoverLease.

          But may be difficult to handle "suppose you have two threads, T1 and T2. They both have a client name of C." case since client is same.

          I think we should do this in HDFS-4688 rather than trying to solve it here.

          Show
          Colin Patrick McCabe added a comment - I don't think adding a new RPC would be too bad. It would be very similar to recoverLease. But may be difficult to handle "suppose you have two threads, T1 and T2. They both have a client name of C." case since client is same. I think we should do this in HDFS-4688 rather than trying to solve it here.
          Hide
          Vinayakumar B added a comment -

          It seems to me like it would be better to call completeFile() or perhaps some new abortFile() RPC, which would first verify that the client name trying to abort the lease is the same as the current lease holder.

          This looks good. Seems this would take lot of code changes and also lot of cases to handle. But may be difficult to handle "suppose you have two threads, T1 and T2. They both have a client name of C." case since client is same.

          Show
          Vinayakumar B added a comment - It seems to me like it would be better to call completeFile() or perhaps some new abortFile() RPC, which would first verify that the client name trying to abort the lease is the same as the current lease holder. This looks good. Seems this would take lot of code changes and also lot of cases to handle. But may be difficult to handle "suppose you have two threads, T1 and T2. They both have a client name of C." case since client is same.
          Hide
          Colin Patrick McCabe added a comment -

          OK, I thought about this a little more. Since we handle symlinks by throwing UnresolvedLinkException, maybe the scenario I outlined can't happen. The client would get UnresolvedLinkException when trying to create /baz/bar, and the resolve it to /foo/bar. At that point, we could reasonably detect that it was the same file as the one in the first thread.

          We might reasonably be able to just do something very similar to recoverLease, but checking that the client name is the same as the one on the lease.

          Show
          Colin Patrick McCabe added a comment - OK, I thought about this a little more. Since we handle symlinks by throwing UnresolvedLinkException , maybe the scenario I outlined can't happen. The client would get UnresolvedLinkException when trying to create /baz/bar , and the resolve it to /foo/bar . At that point, we could reasonably detect that it was the same file as the one in the first thread. We might reasonably be able to just do something very similar to recoverLease, but checking that the client name is the same as the one on the lease.
          Hide
          Colin Patrick McCabe added a comment -

          The problem with calling completeFile is that it may never succeed. If the last block can't be replicated adequately, completeFile will return false forever. I had a change previously which at first called completeFile, but then switched to recoverLease after a few tries. But it seemed like such a corner csae for a corner case that it wasn't worth doing.

          I agree that there are some thorny issues surrounding leases and multiple clients. I looked at this for a long time and concluded that it's impossible to solve these problems without switching the lease mechanism to use (our globally unique) inode numbers.

          One example is: suppose you have two threads, T1 and T2. They both have a client name of C.

          T1 creates a file /foo/bar, writes some stuff, and tries to close. But he fails and becomes a zombie.

          At some point later, T2 creates /baz/bar. Now, /baz is a symlink to /foo. So now the NameNode recovers the lease. But will the zombie recovery thread stomp on T2? It definitely might.

          The problem is that a close attempt needs to be associated with a particular file creation attempt. Right now, all we have is a path and a client name, and these aren't enough to uniquely identify the file creation. Your point is that we should be stricter in matching the client name in create with the client name in completeFile/recoverLease. But even being stricter there won't close all the holes.

          Maybe a good compromise in the meantime is to expose basically expose recoverLeaseInternal(force=false), by adding an optional boolean parameter to the recoverLease protobuf. In the long term, we eventually need a more extensive rework of the leases to be inode-based, which would fix a lot of other sore spots as well (like the rename of open files issue.)

          Show
          Colin Patrick McCabe added a comment - The problem with calling completeFile is that it may never succeed. If the last block can't be replicated adequately, completeFile will return false forever. I had a change previously which at first called completeFile, but then switched to recoverLease after a few tries. But it seemed like such a corner csae for a corner case that it wasn't worth doing. I agree that there are some thorny issues surrounding leases and multiple clients. I looked at this for a long time and concluded that it's impossible to solve these problems without switching the lease mechanism to use (our globally unique) inode numbers. One example is: suppose you have two threads, T1 and T2. They both have a client name of C. T1 creates a file /foo/bar, writes some stuff, and tries to close. But he fails and becomes a zombie. At some point later, T2 creates /baz/bar. Now, /baz is a symlink to /foo. So now the NameNode recovers the lease. But will the zombie recovery thread stomp on T2? It definitely might. The problem is that a close attempt needs to be associated with a particular file creation attempt. Right now, all we have is a path and a client name, and these aren't enough to uniquely identify the file creation. Your point is that we should be stricter in matching the client name in create with the client name in completeFile/recoverLease. But even being stricter there won't close all the holes. Maybe a good compromise in the meantime is to expose basically expose recoverLeaseInternal(force=false), by adding an optional boolean parameter to the recoverLease protobuf. In the long term, we eventually need a more extensive rework of the leases to be inode-based, which would fix a lot of other sore spots as well (like the rename of open files issue.)
          Hide
          Todd Lipcon added a comment -

          I don't think recoverLease is the right API here.. here's an example where it could cause problems:

          • Process A is writing /file, and loses its network connection right before calling close(). Thus it gets registered as a zombie.
          • Process B calls append() on the file after the soft lease has expired. This allows B to keep appending where A left off.
          • Process A recovers its network. The recoverLease() call will then kick process B out from writing.

          Given that these RPCs are also pathname-based, it could even kick a writer off of a new file that just happened to share the file path.

          It seems to me like it would be better to call completeFile() or perhaps some new abortFile() RPC, which would first verify that the client name trying to abort the lease is the same as the current lease holder.

          Show
          Todd Lipcon added a comment - I don't think recoverLease is the right API here.. here's an example where it could cause problems: Process A is writing /file, and loses its network connection right before calling close(). Thus it gets registered as a zombie. Process B calls append() on the file after the soft lease has expired. This allows B to keep appending where A left off. Process A recovers its network. The recoverLease() call will then kick process B out from writing. Given that these RPCs are also pathname-based, it could even kick a writer off of a new file that just happened to share the file path. It seems to me like it would be better to call completeFile() or perhaps some new abortFile() RPC, which would first verify that the client name trying to abort the lease is the same as the current lease holder.
          Hide
          Colin Patrick McCabe added a comment -

          It looks like there are a few more unit tests that need to be fixed. I have some fixes, will post later today.

          Show
          Colin Patrick McCabe added a comment - It looks like there are a few more unit tests that need to be fixed. I have some fixes, will post later today.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12597800/HDFS-4504.011.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 4 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The following test timeouts occurred in hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.TestFileCreationDelete
          org.apache.hadoop.hdfs.TestHFlush
          org.apache.hadoop.hdfs.TestFileCreationClient

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4814//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4814//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12597800/HDFS-4504.011.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 4 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The following test timeouts occurred in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestFileCreationDelete org.apache.hadoop.hdfs.TestHFlush org.apache.hadoop.hdfs.TestFileCreationClient +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4814//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4814//console This message is automatically generated.
          Hide
          Colin Patrick McCabe added a comment -
          • rename DFSClient#endFileLease to DFSClient#stopRenewingFileLease
          • add DFSOutputStream#closeCalled as a separate boolean from DFSOutputStream#close. This ensures that the first time a user calls DFSOutputStream#close, we actually call completeFile.
          • Call completeFile before stopping renewing the file lease, not after.
          Show
          Colin Patrick McCabe added a comment - rename DFSClient#endFileLease to DFSClient#stopRenewingFileLease add DFSOutputStream#closeCalled as a separate boolean from DFSOutputStream#close. This ensures that the first time a user calls DFSOutputStream#close, we actually call completeFile. Call completeFile before stopping renewing the file lease, not after.
          Hide
          Colin Patrick McCabe added a comment -

          You're right-- we're going to need a solution for ensuring that completeFile is called when the stream closes itself due to IOException.

          Show
          Colin Patrick McCabe added a comment - You're right-- we're going to need a solution for ensuring that completeFile is called when the stream closes itself due to IOException.
          Hide
          Vinayakumar B added a comment -

          In some cases, DFSOutputStream#close and DFSOutputStream#lastException will be set by the DataStreamer, prior to DFSOutputStream#close being called. In those cases, we need to throw an exception from close prior to clearing the exception.

          I assume these cases were never handled. Without handling pipeline failure cases, this patch will be incomplete.
          Pipeline failures while writing data are also most likely to happen.

          In case of pipeline failures closed will be marked true by DataStreamer thread itself (as mentioned already in Colin Patrick McCabe comment). On first call to close() will throw the pipeline failure exception, but next calls to close() just returns. So Stream will never be marked as zombie, also resources will never be released.

          You can verify by changing your test testCloseWithDatanodeDown as follows

          +      out.write(100);
          +      cluster.stopDataNode(0);

          to

          +      out.write(100);
          +      out.hflush();
          +      out.write(100);
          +      cluster.stopDataNode(0);

          Please check.

          Show
          Vinayakumar B added a comment - In some cases, DFSOutputStream#close and DFSOutputStream#lastException will be set by the DataStreamer, prior to DFSOutputStream#close being called. In those cases, we need to throw an exception from close prior to clearing the exception. I assume these cases were never handled. Without handling pipeline failure cases, this patch will be incomplete. Pipeline failures while writing data are also most likely to happen. In case of pipeline failures closed will be marked true by DataStreamer thread itself (as mentioned already in Colin Patrick McCabe comment). On first call to close() will throw the pipeline failure exception, but next calls to close() just returns. So Stream will never be marked as zombie, also resources will never be released. You can verify by changing your test testCloseWithDatanodeDown as follows + out.write(100); + cluster.stopDataNode(0); to + out.write(100); + out.hflush(); + out.write(100); + cluster.stopDataNode(0); Please check.
          Hide
          Colin Patrick McCabe added a comment -

          TreeMaps are memory inefficient, so I'd prefer to use something else unless we need ordering. I believe ConcurrentHashMap has safe concurrent iteration behavior.

          A balanced tree should take up effectively no memory when it's empty, which this tree will be almost all of the time. In contrast, a hash table needs to pre-reserve a chunk of memory to avoid collisions. Minimum capacity for java.util.HashMap seems to be 16, which means you'll have to spend 16 * sizeof(pointer to jobject) = 128 bytes just to do nothing (which is what you'll be doing most of the time). Also, we do need ordering here. Finally, iterating over hash maps is slow because you have to check all the null entries.

          A hash map would be a really terrible thing to use here, because if it did ever grow, that would be just a transitory event. And then it would never shrink again, but just continue using that memory forever. Check HashMap.java if you are curious.

          Any reason for the StreamInfo class over a Long?

          StreamInfo is mutable; Long is not. Although I could remove and re-insert the element into the map, it seems unnecessary.

          Unrelated to this patch, but DFSClient#endFileLease is poorly named, since it doesn't actually end the lease. Maybe rename it to stopLeaseRenewal?

          OK, I'll change it to stopLeaseRenewal.

          TestHdfsClose: should use GenericTestUtils#assertExceptionContains when catching an expected IOException.

          It's clearly going to throw some IOException, because the DataNode/NameNode/etc are down, but I don't think the message inside the IOException is important.

          I noticed that endFileLease is now called before completeFile, could the lease expire between these two calls? I guess it'll still get cleaned up okay by the ZSM, but this is extra work.

          I suppose I might as well wait until after completeFile to end the lease renewal. As you've already inferred, it's extremely unlikely to make a difference, since if there are absolutely any other files open by the client, the lease will still be renewed.

          Show
          Colin Patrick McCabe added a comment - TreeMaps are memory inefficient, so I'd prefer to use something else unless we need ordering. I believe ConcurrentHashMap has safe concurrent iteration behavior. A balanced tree should take up effectively no memory when it's empty, which this tree will be almost all of the time. In contrast, a hash table needs to pre-reserve a chunk of memory to avoid collisions. Minimum capacity for java.util.HashMap seems to be 16, which means you'll have to spend 16 * sizeof(pointer to jobject) = 128 bytes just to do nothing (which is what you'll be doing most of the time). Also, we do need ordering here. Finally, iterating over hash maps is slow because you have to check all the null entries. A hash map would be a really terrible thing to use here, because if it did ever grow, that would be just a transitory event. And then it would never shrink again, but just continue using that memory forever. Check HashMap.java if you are curious. Any reason for the StreamInfo class over a Long? StreamInfo is mutable; Long is not. Although I could remove and re-insert the element into the map, it seems unnecessary. Unrelated to this patch, but DFSClient#endFileLease is poorly named, since it doesn't actually end the lease. Maybe rename it to stopLeaseRenewal? OK, I'll change it to stopLeaseRenewal. TestHdfsClose: should use GenericTestUtils#assertExceptionContains when catching an expected IOException. It's clearly going to throw some IOException, because the DataNode/NameNode/etc are down, but I don't think the message inside the IOException is important. I noticed that endFileLease is now called before completeFile, could the lease expire between these two calls? I guess it'll still get cleaned up okay by the ZSM, but this is extra work. I suppose I might as well wait until after completeFile to end the lease renewal. As you've already inferred, it's extremely unlikely to make a difference, since if there are absolutely any other files open by the client, the lease will still be renewed.
          Hide
          Andrew Wang added a comment -

          Some light review comments, I'd prefer if someone more experienced with the write path also reviewed this before committing.

          +      // We couldn't contact the NameNode (or it refused to close the file.)
          

          Nit: period goes outside the parenthesis.

          • TreeMaps are memory inefficient, so I'd prefer to use something else unless we need ordering. I believe ConcurrentHashMap has safe concurrent iteration behavior.
          • Any reason for the StreamInfo class over a Long?
          • TestHdfsClose: should use GenericTestUtils#assertExceptionContains when catching an expected IOException.
          • Unrelated to this patch, but DFSClient#endFileLease is poorly named, since it doesn't actually end the lease. Maybe rename it to stopLeaseRenewal?
          • I noticed that endFileLease is now called before completeFile, could the lease expire between these two calls? I guess it'll still get cleaned up okay by the ZSM, but this is extra work.
          Show
          Andrew Wang added a comment - Some light review comments, I'd prefer if someone more experienced with the write path also reviewed this before committing. + // We couldn't contact the NameNode (or it refused to close the file.) Nit: period goes outside the parenthesis. TreeMaps are memory inefficient, so I'd prefer to use something else unless we need ordering. I believe ConcurrentHashMap has safe concurrent iteration behavior. Any reason for the StreamInfo class over a Long? TestHdfsClose: should use GenericTestUtils#assertExceptionContains when catching an expected IOException. Unrelated to this patch, but DFSClient#endFileLease is poorly named, since it doesn't actually end the lease. Maybe rename it to stopLeaseRenewal ? I noticed that endFileLease is now called before completeFile , could the lease expire between these two calls? I guess it'll still get cleaned up okay by the ZSM, but this is extra work.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12597529/HDFS-4504.010.patch
          against trunk revision .

          -1 patch. Trunk compilation may be broken.

          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4803//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12597529/HDFS-4504.010.patch against trunk revision . -1 patch . Trunk compilation may be broken. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4803//console This message is automatically generated.
          Hide
          Colin Patrick McCabe added a comment -
          • TestDFSClientRetries#testNotYetReplicatedErrors was assuming that close didn't actually call complete() on the NameNode when there was an I/O error. I fixed this assumption by adding a complete() method which returned true on the mock NameNode object used in the test.
          • DFSOutputStream#close needed to set lastException to null after successfully stopping the I/O threads. The reason is that IOExceptions are handled in the throw clause already, so there is no need to rethrow the next time DFSOutputStream#close is called. Only when complete fails do we want to keep the exception around.
          Show
          Colin Patrick McCabe added a comment - TestDFSClientRetries#testNotYetReplicatedErrors was assuming that close didn't actually call complete() on the NameNode when there was an I/O error. I fixed this assumption by adding a complete() method which returned true on the mock NameNode object used in the test. DFSOutputStream#close needed to set lastException to null after successfully stopping the I/O threads. The reason is that IOExceptions are handled in the throw clause already, so there is no need to rethrow the next time DFSOutputStream#close is called. Only when complete fails do we want to keep the exception around.
          Hide
          Jason Lowe added a comment -

          There was a build failure in the test build because TestDFSClientRetries either timed out or called System.exit, but Jenkins missed it due to HADOOP-9583.

          Show
          Jason Lowe added a comment - There was a build failure in the test build because TestDFSClientRetries either timed out or called System.exit, but Jenkins missed it due to HADOOP-9583 .
          Hide
          Vinayakumar B added a comment -

          DFSOutputStream#close can throw an IOException in some cases. One example is if there is a pipeline error and then pipeline recovery fails. Unfortunately, in this case, some of the resources used by the DFSOutputStream are leaked. One particularly important resource is file leases.

          According to desctription, this patch suppose to handle pipeline recovery case, but it misses mainly that case.

          I am not sure how in jenkins this passing, TestHdfsClose.testCloseWithDatanodeDown() always failing for me. last out.close() in the test always throws error, according to code thats how it is.

          Show
          Vinayakumar B added a comment - DFSOutputStream#close can throw an IOException in some cases. One example is if there is a pipeline error and then pipeline recovery fails. Unfortunately, in this case, some of the resources used by the DFSOutputStream are leaked. One particularly important resource is file leases. According to desctription, this patch suppose to handle pipeline recovery case, but it misses mainly that case. I am not sure how in jenkins this passing, TestHdfsClose.testCloseWithDatanodeDown() always failing for me. last out.close() in the test always throws error, according to code thats how it is.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12597191/HDFS-4504.009.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4797//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4797//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12597191/HDFS-4504.009.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 3 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4797//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4797//console This message is automatically generated.
          Hide
          Colin Patrick McCabe added a comment -

          in some cases, DFSOutputStream#close and DFSOutputStream#lastException will be set by the DataStreamer, prior to DFSOutputStream#close being called. In those cases, we need to throw an exception from close prior to clearing the exception.

          Show
          Colin Patrick McCabe added a comment - in some cases, DFSOutputStream#close and DFSOutputStream#lastException will be set by the DataStreamer , prior to DFSOutputStream#close being called. In those cases, we need to throw an exception from close prior to clearing the exception.
          Hide
          Colin Patrick McCabe added a comment -

          The TestBalancerWithNodeGroup test timeout is HDFS-4376.

          Show
          Colin Patrick McCabe added a comment - The TestBalancerWithNodeGroup test timeout is HDFS-4376 .
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12597009/HDFS-4504.008.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.TestFileCreation
          org.apache.hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4789//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4789//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12597009/HDFS-4504.008.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 3 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestFileCreation org.apache.hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4789//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4789//console This message is automatically generated.
          Hide
          Colin Patrick McCabe added a comment -
          • fix findbugs warning about equals() needing to check for null. Suppress bogus warnings about serializing the ZombieStreamManager
          • fix two tests that were assuming that DFSOutputStream#close did not actually call completeFile if the flush failed. In both cases, the test just wanted to test that (h)flush failed, so do that instead.
          Show
          Colin Patrick McCabe added a comment - fix findbugs warning about equals () needing to check for null . Suppress bogus warnings about serializing the ZombieStreamManager fix two tests that were assuming that DFSOutputStream#close did not actually call completeFile if the flush failed. In both cases, the test just wanted to test that (h)flush failed, so do that instead.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12596972/HDFS-4504.007.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 3 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.TestCrcCorruption
          org.apache.hadoop.hdfs.server.datanode.TestBlockRecovery

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4787//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/4787//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4787//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12596972/HDFS-4504.007.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 3 new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestCrcCorruption org.apache.hadoop.hdfs.server.datanode.TestBlockRecovery +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4787//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/4787//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4787//console This message is automatically generated.
          Hide
          Colin Patrick McCabe added a comment -

          This patch creates a background thread for handling uncloseable files. Streams get placed into the ZombieStreamManager when close() fails to contact the NameNode. It uses an ExecutorService, so the OS thread will be properly disposed of when it's not in use.

          The client can figure out when the file is closed on the NameNode by polling DFSOutputStream#close. When the lease recovery succeeds, DFSOutputStream#close will stop throwing an IOException. At that point, the client can re-open that file if it wishes. This is a lot better than the current situation, where the client doesn't know when, or if, the file will ever be safe to re-open.

          TestHdfsClose test a few different cases: all of the DataNodes going down, all of the NameNodes going down, and the client calling DistributedFileSystem#abort. In every case, we should be able to keep going after an error and not run into uncloseable files.

          Show
          Colin Patrick McCabe added a comment - This patch creates a background thread for handling uncloseable files. Streams get placed into the ZombieStreamManager when close() fails to contact the NameNode. It uses an ExecutorService , so the OS thread will be properly disposed of when it's not in use. The client can figure out when the file is closed on the NameNode by polling DFSOutputStream#close . When the lease recovery succeeds, DFSOutputStream#close will stop throwing an IOException . At that point, the client can re-open that file if it wishes. This is a lot better than the current situation, where the client doesn't know when, or if, the file will ever be safe to re-open. TestHdfsClose test a few different cases: all of the DataNodes going down, all of the NameNodes going down, and the client calling DistributedFileSystem#abort . In every case, we should be able to keep going after an error and not run into uncloseable files.
          Hide
          Colin Patrick McCabe added a comment -

          That's an interesting idea-- calling recoverLease from the client itself. It might have the advantage of requiring less new code, compared to adding a new flag to complete().

          Show
          Colin Patrick McCabe added a comment - That's an interesting idea-- calling recoverLease from the client itself. It might have the advantage of requiring less new code, compared to adding a new flag to complete() .
          Hide
          Uma Maheswara Rao G added a comment -

          Thanks Colin, for working on this issue.
          Just to summarize:
          Per my understanding here there are 2 issues, 1.leaving stale refernces when there is failures in close call. 2. For long lived client, if cmpleFile fails, no one will recover is as client will renewLease

          for #1, fix would little straight forward.
          for #2, Kihwal brought some cases above.

          •Extend complete() by adding an optional boolean arg, "force". Things will stay compatible. If a new client is talking to an old NN, the file may not get completed right away, but this is no worse than current behavior. The client (lease renewer) can keep trying periodically. Probably less often than the lease renewal. We may only allow this when lastBlock is present, since the acked block length will reduce the risk of truncating valid data.

          Since the current close call already closes streamer, where we maintain this last block? you mean we will introduce another structure for it and check periodially in renewer/anyother thread?

          (or) How about checking the filesBeingWritten file state. If the FileBeingWritten state is closed from Clinet perspective but completFile/flushbuffer failed. So, we will not remove that references staright away from DFsClient. In this case, Renewer will check such files(closed) and check real file status from NN. If the file closed from NN(isFileClosed added in trunk I guess) , then remove from the fileBeingWritten list directly. Otherwise make a call ourselves recoverLease (as we know no one is going to do recover for such files).

          Show
          Uma Maheswara Rao G added a comment - Thanks Colin, for working on this issue. Just to summarize: Per my understanding here there are 2 issues, 1.leaving stale refernces when there is failures in close call. 2. For long lived client, if cmpleFile fails, no one will recover is as client will renewLease for #1, fix would little straight forward. for #2, Kihwal brought some cases above. •Extend complete() by adding an optional boolean arg, "force". Things will stay compatible. If a new client is talking to an old NN, the file may not get completed right away, but this is no worse than current behavior. The client (lease renewer) can keep trying periodically. Probably less often than the lease renewal. We may only allow this when lastBlock is present, since the acked block length will reduce the risk of truncating valid data. Since the current close call already closes streamer, where we maintain this last block? you mean we will introduce another structure for it and check periodially in renewer/anyother thread? (or) How about checking the filesBeingWritten file state. If the FileBeingWritten state is closed from Clinet perspective but completFile/flushbuffer failed. So, we will not remove that references staright away from DFsClient. In this case, Renewer will check such files(closed) and check real file status from NN. If the file closed from NN(isFileClosed added in trunk I guess) , then remove from the fileBeingWritten list directly. Otherwise make a call ourselves recoverLease (as we know no one is going to do recover for such files).
          Hide
          Colin Patrick McCabe added a comment -

          I think the vast majority of these cases will simply be handled by block recovery. The other part of the time, block recovery has gotten into a state where it will never succeed, and we simply need to deal with that situation. Probably the best way is adding the force flag to completeFile.

          Show
          Colin Patrick McCabe added a comment - I think the vast majority of these cases will simply be handled by block recovery. The other part of the time, block recovery has gotten into a state where it will never succeed, and we simply need to deal with that situation. Probably the best way is adding the force flag to completeFile .
          Hide
          Kihwal Lee added a comment -

          If the datanodes didn't get the last packet of the last block or they died before reporting to NN of the completion, completeFile() may never work. I can think of several ways.

          • Delete the incomplete file. It will remove the lease. This will violate the data durability semantics, so it's not feasible to do it in DFSClient. Apps may do this if close() throws an exception.
          • Introduce a new ClientProtocol method, releaseLease(), which triggers immediate block recovery if necessary. This is an incompatible change, so less desirable.
          • Extend complete() by adding an optional boolean arg, "force". Things will stay compatible. If a new client is talking to an old NN, the file may not get completed right away, but this is no worse than current behavior. The client (lease renewer) can keep trying periodically. Probably less often than the lease renewal. We may only allow this when lastBlock is present, since the acked block length will reduce the risk of truncating valid data.
          Show
          Kihwal Lee added a comment - If the datanodes didn't get the last packet of the last block or they died before reporting to NN of the completion, completeFile() may never work. I can think of several ways. Delete the incomplete file. It will remove the lease. This will violate the data durability semantics, so it's not feasible to do it in DFSClient. Apps may do this if close() throws an exception. Introduce a new ClientProtocol method, releaseLease(), which triggers immediate block recovery if necessary. This is an incompatible change, so less desirable. Extend complete() by adding an optional boolean arg, "force". Things will stay compatible. If a new client is talking to an old NN, the file may not get completed right away, but this is no worse than current behavior. The client (lease renewer) can keep trying periodically. Probably less often than the lease renewal. We may only allow this when lastBlock is present, since the acked block length will reduce the risk of truncating valid data.
          Hide
          Colin Patrick McCabe added a comment -

          Does this fully solve the problem, given that leases are per-client, not per-file? ie so long as the long-lived client has any other open files for write, it will keep calling renewLease() and the file will be stuck open and un-recovered forever.

          Thanks for pointing that out. I think this is a real problem with my current patch and is likely to lead to the kind of scenario we've seen in the field in the past, where a long-lived HDFS client program like Flume gets some files in limbo after transient network problems during a close operation.

          The only way around that I can think of is to keep around a list of uncompleted, but closed files. The lease renewer thread can call complete on them prior to renewing the lease with the NameNode.

          Show
          Colin Patrick McCabe added a comment - Does this fully solve the problem, given that leases are per-client, not per-file? ie so long as the long-lived client has any other open files for write, it will keep calling renewLease() and the file will be stuck open and un-recovered forever. Thanks for pointing that out. I think this is a real problem with my current patch and is likely to lead to the kind of scenario we've seen in the field in the past, where a long-lived HDFS client program like Flume gets some files in limbo after transient network problems during a close operation. The only way around that I can think of is to keep around a list of uncompleted, but closed files. The lease renewer thread can call complete on them prior to renewing the lease with the NameNode.
          Hide
          Todd Lipcon added a comment -

          Does this fully solve the problem, given that leases are per-client, not per-file? ie so long as the long-lived client has any other open files for write, it will keep calling renewLease() and the file will be stuck open and un-recovered forever.

          Show
          Todd Lipcon added a comment - Does this fully solve the problem, given that leases are per-client, not per-file? ie so long as the long-lived client has any other open files for write, it will keep calling renewLease() and the file will be stuck open and un-recovered forever.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12579431/HDFS-4504.002.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4275//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4275//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12579431/HDFS-4504.002.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4275//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4275//console This message is automatically generated.
          Hide
          Colin Patrick McCabe added a comment -
          • remove manual retry (allow RPC retry to do this job)
          • remove debug messages
          • TestBlockRecovery: set lower RPC timeout so that we give up on close prior to the test timeout.
          • If both completeFile and the block data flush fail, throw a MultipleIOException with information about both failures.
          Show
          Colin Patrick McCabe added a comment - remove manual retry (allow RPC retry to do this job) remove debug messages TestBlockRecovery: set lower RPC timeout so that we give up on close prior to the test timeout. If both completeFile and the block data flush fail, throw a MultipleIOException with information about both failures.
          Hide
          Colin Patrick McCabe added a comment -

          RPC retry should be done in RPC level. Please don't add retry to DFSClient.

          OK. I will remove this in the next patch.

          I suppose lease recovery will take care of un-completed files eventually.

          Show
          Colin Patrick McCabe added a comment - RPC retry should be done in RPC level. Please don't add retry to DFSClient. OK. I will remove this in the next patch. I suppose lease recovery will take care of un-completed files eventually.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12579233/HDFS-4504.001.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup
          org.apache.hadoop.hdfs.server.datanode.TestBlockRecovery
          org.apache.hadoop.hdfs.TestHFlush
          org.apache.hadoop.hdfs.TestFileCreation
          org.apache.hadoop.hdfs.TestFileAppend4
          org.apache.hadoop.hdfs.TestQuota
          org.apache.hadoop.hdfs.server.namenode.TestSaveNamespace
          org.apache.hadoop.cli.TestHDFSCLI
          org.apache.hadoop.hdfs.TestDFSClientRetries

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4272//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/4272//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4272//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12579233/HDFS-4504.001.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup org.apache.hadoop.hdfs.server.datanode.TestBlockRecovery org.apache.hadoop.hdfs.TestHFlush org.apache.hadoop.hdfs.TestFileCreation org.apache.hadoop.hdfs.TestFileAppend4 org.apache.hadoop.hdfs.TestQuota org.apache.hadoop.hdfs.server.namenode.TestSaveNamespace org.apache.hadoop.cli.TestHDFSCLI org.apache.hadoop.hdfs.TestDFSClientRetries +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4272//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/4272//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4272//console This message is automatically generated.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > We should retry the complete RPC periodically in case we are going through e.g. a NameNode failover, or just general network problems.

          RPC retry should be done in RPC level. Please don't add retry to DFSClient.

          Show
          Tsz Wo Nicholas Sze added a comment - > We should retry the complete RPC periodically in case we are going through e.g. a NameNode failover, or just general network problems. RPC retry should be done in RPC level. Please don't add retry to DFSClient.
          Hide
          Colin Patrick McCabe added a comment -

          Another issue we could hit here is that if we temporarily can't talk to the NameNode, we could get an IOException when calling DFSOutputStream#completeFile. This is another case where we will "leak" the lease.

          We should retry the complete RPC periodically in case we are going through e.g. a NameNode failover, or just general network problems.

          Show
          Colin Patrick McCabe added a comment - Another issue we could hit here is that if we temporarily can't talk to the NameNode, we could get an IOException when calling DFSOutputStream#completeFile . This is another case where we will "leak" the lease. We should retry the complete RPC periodically in case we are going through e.g. a NameNode failover, or just general network problems.
          Hide
          Colin Patrick McCabe added a comment -

          Here is the code for DFSOutputStream#close:

            /**
             * Closes this output stream and releases any system 
             * resources associated with this stream.
             */
            @Override
            public synchronized void close() throws IOException {
              if (closed) {
                IOException e = lastException;
                if (e == null)
                  return;
                else
                  throw e;
              }
          
              try {
                flushBuffer();       // flush from all upper layers
          
                if (currentPacket != null) { 
                  waitAndQueueCurrentPacket();
                }
          
                if (bytesCurBlock != 0) {
                  // send an empty packet to mark the end of the block
                  currentPacket = new Packet(0, 0, bytesCurBlock, 
                      currentSeqno++, this.checksum.getChecksumSize());
                  currentPacket.lastPacketInBlock = true;
                  currentPacket.syncBlock = shouldSyncBlock;
                }
          
                flushInternal();             // flush all data to Datanodes
                // get last block before destroying the streamer
                ExtendedBlock lastBlock = streamer.getBlock();
                closeThreads(false);
                completeFile(lastBlock);
                dfsClient.endFileLease(src);
              } finally {
                closed = true;
              }
            }
          

          As you can see, if DFSOutputStream#flushBuffer throws an IOException, DFSOutputStream#closed will be set to true, but we'll never call dfsClient.endFileLease. Future attempts to close the DFSOutputStream will just hit the if statement which rethrows lastException.

          The file can never be closed until we take down the whole client-- which is a problem for long-lived clients like HBase, Flume, etc.

          Show
          Colin Patrick McCabe added a comment - Here is the code for DFSOutputStream#close : /** * Closes this output stream and releases any system * resources associated with this stream. */ @Override public synchronized void close() throws IOException { if (closed) { IOException e = lastException; if (e == null ) return ; else throw e; } try { flushBuffer(); // flush from all upper layers if (currentPacket != null ) { waitAndQueueCurrentPacket(); } if (bytesCurBlock != 0) { // send an empty packet to mark the end of the block currentPacket = new Packet(0, 0, bytesCurBlock, currentSeqno++, this .checksum.getChecksumSize()); currentPacket.lastPacketInBlock = true ; currentPacket.syncBlock = shouldSyncBlock; } flushInternal(); // flush all data to Datanodes // get last block before destroying the streamer ExtendedBlock lastBlock = streamer.getBlock(); closeThreads( false ); completeFile(lastBlock); dfsClient.endFileLease(src); } finally { closed = true ; } } As you can see, if DFSOutputStream#flushBuffer throws an IOException , DFSOutputStream#closed will be set to true, but we'll never call dfsClient.endFileLease . Future attempts to close the DFSOutputStream will just hit the if statement which rethrows lastException . The file can never be closed until we take down the whole client-- which is a problem for long-lived clients like HBase, Flume, etc.

            People

            • Assignee:
              Colin Patrick McCabe
              Reporter:
              Colin Patrick McCabe
            • Votes:
              0 Vote for this issue
              Watchers:
              21 Start watching this issue

              Dates

              • Created:
                Updated:

                Development