Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-7314

When the DFSClient lease cannot be renewed, abort open-for-write files rather than the entire DFSClient

    Details

      Description

      It happened in YARN nodemanger scenario. But it could happen to any long running service that use cached instance of DistrbutedFileSystem.

      1. Active NN is under heavy load. So it became unavailable for 10 minutes; any DFSClient request will get ConnectTimeoutException.

      2. YARN nodemanager use DFSClient for certain write operation such as log aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's renewLease RPC got ConnectTimeoutException.

      2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds.  Aborting ...
      

      3. After DFSClient is in Aborted state, YARN NM can't use that cached instance of DistributedFileSystem.

      2014-10-29 20:26:23,991 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Failed to download rsrc...
      java.io.IOException: Filesystem closed
              at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727)
              at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780)
              at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
              at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
              at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
              at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
              at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237)
              at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340)
              at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57)
              at java.util.concurrent.FutureTask.run(FutureTask.java:262)
              at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
              at java.util.concurrent.FutureTask.run(FutureTask.java:262)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
              at java.lang.Thread.run(Thread.java:745)
      

      We can make YARN or DFSClient more tolerant to temporary NN unavailability. Given the callstack is YARN -> DistributedFileSystem -> DFSClient, this can be addressed at different layers.

      • YARN closes the DistributedFileSystem object when it receives some well defined exception. Then the next HDFS call will create a new instance of DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS applications need to address this as well.
      • DistributedFileSystem detects Aborted DFSClient and create a new instance of DFSClient. We will need to fix all the places DistributedFileSystem calls DFSClient.
      • After DFSClient gets into Aborted state, it doesn't have to reject all requests , instead it can retry. If NN is available again it can transition to healthy state.

      Comments?

      1. HDFS-7314.patch
        9 kB
        Ming Ma
      2. HDFS-7314-2.patch
        6 kB
        Ming Ma
      3. HDFS-7314-3.patch
        6 kB
        Ming Ma
      4. HDFS-7314-4.patch
        9 kB
        Ming Ma
      5. HDFS-7314-5.patch
        7 kB
        Ming Ma
      6. HDFS-7314-6.patch
        7 kB
        Ming Ma
      7. HDFS-7314-7.patch
        8 kB
        Ming Ma
      8. HDFS-7314-8.patch
        6 kB
        Ming Ma
      9. HDFS-7314-9.patch
        6 kB
        Ming Ma
      10. HDFS-7314-branch-2.7.2.txt
        7 kB
        Vinod Kumar Vavilapalli

        Activity

        Hide
        kihwal Kihwal Lee added a comment -

        I think DFSClient#abort() can be changed, so that only the existing output streams are aborted. The underlying IPC client can try to reopen connection later.

        Show
        kihwal Kihwal Lee added a comment - I think DFSClient#abort() can be changed, so that only the existing output streams are aborted. The underlying IPC client can try to reopen connection later.
        Hide
        cmccabe Colin P. McCabe added a comment -

        I think DFSClient#abort() can be changed, so that only the existing output streams are aborted. The underlying IPC client can try to reopen connection later.

        Good idea. The only thing to watch out for is that some unit tests might be using abort and expecting the current semantics. Perhaps we can create a new function, abortOpenStreams?

        Show
        cmccabe Colin P. McCabe added a comment - I think DFSClient#abort() can be changed, so that only the existing output streams are aborted. The underlying IPC client can try to reopen connection later. Good idea. The only thing to watch out for is that some unit tests might be using abort and expecting the current semantics. Perhaps we can create a new function, abortOpenStreams ?
        Hide
        mingma Ming Ma added a comment -

        Thanks Kihwal Lee and Colin P. McCabe for the good suggestions.

        Here is the initial patch that changes the behavior of DFSClient's abort. There might be scenarios that prefer the current behavior so it is configurable. Unit tests results look good so we don't have to define a new abortOutputStream function. To make sure it works for the case where the application tries to create files while leaseRenewal thread is aborting, leaseRenewal thread no longer exits when it receives SocketTimeoutException; otherwise, it is possible no thread will handle the lease renewal for the newly created files.

        Also fix the incorrect log message and add some helper function to leaseRenewal to help with unit tests.

        Show
        mingma Ming Ma added a comment - Thanks Kihwal Lee and Colin P. McCabe for the good suggestions. Here is the initial patch that changes the behavior of DFSClient's abort. There might be scenarios that prefer the current behavior so it is configurable. Unit tests results look good so we don't have to define a new abortOutputStream function. To make sure it works for the case where the application tries to create files while leaseRenewal thread is aborting, leaseRenewal thread no longer exits when it receives SocketTimeoutException; otherwise, it is possible no thread will handle the lease renewal for the newly created files. Also fix the incorrect log message and add some helper function to leaseRenewal to help with unit tests.
        Hide
        hadoopqa Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12679168/HDFS-7314.patch
        against trunk revision 2bb327e.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 1 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. There were no new javadoc warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8638//testReport/
        Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8638//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12679168/HDFS-7314.patch against trunk revision 2bb327e. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8638//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8638//console This message is automatically generated.
        Hide
        cmccabe Colin P. McCabe added a comment -

        Thanks, Ming Ma. It's interesting that all the unit tests pass with the changed behavior of DFSClient#abort.

        I would prefer not to add this new configuration key, because I really can't think of any cases where I'd like to set it to true.

        I think it would be better just to have the lease timeout logic call a function other than DFSClient#abort. Basically create something like DFSClient#abortOpenFiles and have the lease timeout code call this instead of abort. That way we don't get confused about what abort means, but we also have the nice behavior that our client continues to be useful after a lease timeout.

        Show
        cmccabe Colin P. McCabe added a comment - Thanks, Ming Ma . It's interesting that all the unit tests pass with the changed behavior of DFSClient#abort . I would prefer not to add this new configuration key, because I really can't think of any cases where I'd like to set it to true . I think it would be better just to have the lease timeout logic call a function other than DFSClient#abort . Basically create something like DFSClient#abortOpenFiles and have the lease timeout code call this instead of abort. That way we don't get confused about what abort means, but we also have the nice behavior that our client continues to be useful after a lease timeout.
        Hide
        mingma Ming Ma added a comment -

        Thanks, Colin P. McCabe. I have updated the patch based on your suggestion.

        Show
        mingma Ming Ma added a comment - Thanks, Colin P. McCabe . I have updated the patch based on your suggestion.
        Hide
        hadoopqa Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12679296/HDFS-7314-2.patch
        against trunk revision 1eed102.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 1 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. There were no new javadoc warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8642//testReport/
        Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8642//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12679296/HDFS-7314-2.patch against trunk revision 1eed102. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8642//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8642//console This message is automatically generated.
        Hide
        cmccabe Colin P. McCabe added a comment -

        HDFS-7314-2.patch just seems to rename abort to abortOpenFiles. What I was suggesting was creating a separate function, different from abort, which the LeaseRenewer would call. Actually, looking at it, I wonder if the lease renewer can just call closeAllFilesBeingWritten? I haven't looked at it in detail so maybe there's something else the lease renewer needs to do, but this at least looks like a good start.

        We don't need all this boolean removeFromFactory stuff. getInstance will re-add the DFSClient to the map later if needed.

        Show
        cmccabe Colin P. McCabe added a comment - HDFS-7314 -2.patch just seems to rename abort to abortOpenFiles . What I was suggesting was creating a separate function, different from abort , which the LeaseRenewer would call. Actually, looking at it, I wonder if the lease renewer can just call closeAllFilesBeingWritten ? I haven't looked at it in detail so maybe there's something else the lease renewer needs to do, but this at least looks like a good start. We don't need all this boolean removeFromFactory stuff. getInstance will re-add the DFSClient to the map later if needed.
        Hide
        mingma Ming Ma added a comment -

        Thanks, Colin. Here are more explanations for the changes. Please let me know your thoughts. Appreciate your input.

        1. abort is only used for this scenario. After we have LeaseRenewer call abortOpenFiles, abort won't be called by any functions.
        2. In addition to have DFSClient call closeAllFilesBeingWritten, LeaseRenewer also needs to remove the DFSClient from its list via dfsclients.remove(dfsc); so that DFSClient doesn't renew release when there are no files opened. This is achieved via LeaseRenewer's closeClient.
        3. Whether LeaseRenewer should be removed from the factory when it gets SocketTimeoutException. Given LeaseRenewer thread won't exit when it gets SocketTimeoutException as part of the fix, if LeaseRenewer object is removed from the factory, then it could leak the LeaseRenewer thread even though the old LeaseRenewer object isn't used by other objects. In reality, LeaseRenewer won't be removed from the factory inside closeClient given given isRenewerExpired() will return false. So removeFromFactory is there mostly for the semantics, not necessary.

        Show
        mingma Ming Ma added a comment - Thanks, Colin. Here are more explanations for the changes. Please let me know your thoughts. Appreciate your input. 1. abort is only used for this scenario. After we have LeaseRenewer call abortOpenFiles , abort won't be called by any functions. 2. In addition to have DFSClient call closeAllFilesBeingWritten , LeaseRenewer also needs to remove the DFSClient from its list via dfsclients.remove(dfsc); so that DFSClient doesn't renew release when there are no files opened. This is achieved via LeaseRenewer 's closeClient . 3. Whether LeaseRenewer should be removed from the factory when it gets SocketTimeoutException. Given LeaseRenewer thread won't exit when it gets SocketTimeoutException as part of the fix, if LeaseRenewer object is removed from the factory, then it could leak the LeaseRenewer thread even though the old LeaseRenewer object isn't used by other objects. In reality, LeaseRenewer won't be removed from the factory inside closeClient given given isRenewerExpired() will return false. So removeFromFactory is there mostly for the semantics, not necessary.
        Hide
        cmccabe Colin P. McCabe added a comment -

        1. abort is only used for this scenario. After we have LeaseRenewer call abortOpenFiles, abort won't be called by any functions.

        Good point. Let's get rid of DFSClient#abort completely then. We don't need this function any more.

        2. In addition to have DFSClient call closeAllFilesBeingWritten, LeaseRenewer also needs to remove the DFSClient from its list via dfsclients.remove(dfsc); so that DFSClient doesn't renew release when there are no files opened. This is achieved via LeaseRenewer's closeClient.

        When a lease timeout occurs, LeaseRenewer can just call DFSClient#closeAllFilesBeingWritten(abort=true). Then LeaseRenewer can just call LeaseRenewer#closeClient on itself. This avoids the need to modify LeaseRenewer#closeClient.

        @@ -447,16 +453,17 @@ private void run(final int id) throws InterruptedException {
                   lastRenewed = Time.now();
                 } catch (SocketTimeoutException ie) {
                   LOG.warn("Failed to renew lease for " + clientsString() + " for "
        -              + (elapsed/1000) + " seconds.  Aborting ...", ie);
        +              + ((Time.now() - lastRenewed)/1000) + " seconds.  Aborting ...",
        +              ie);
                   synchronized (this) {
                     while (!dfsclients.isEmpty()) {
        

        I don't think we need this change and the other similar change.

        Show
        cmccabe Colin P. McCabe added a comment - 1. abort is only used for this scenario. After we have LeaseRenewer call abortOpenFiles, abort won't be called by any functions. Good point. Let's get rid of DFSClient#abort completely then. We don't need this function any more. 2. In addition to have DFSClient call closeAllFilesBeingWritten, LeaseRenewer also needs to remove the DFSClient from its list via dfsclients.remove(dfsc); so that DFSClient doesn't renew release when there are no files opened. This is achieved via LeaseRenewer's closeClient. When a lease timeout occurs, LeaseRenewer can just call DFSClient#closeAllFilesBeingWritten(abort=true) . Then LeaseRenewer can just call LeaseRenewer#closeClient on itself. This avoids the need to modify LeaseRenewer#closeClient . @@ -447,16 +453,17 @@ private void run( final int id) throws InterruptedException { lastRenewed = Time.now(); } catch (SocketTimeoutException ie) { LOG.warn( "Failed to renew lease for " + clientsString() + " for " - + (elapsed/1000) + " seconds. Aborting ..." , ie); + + ((Time.now() - lastRenewed)/1000) + " seconds. Aborting ..." , + ie); synchronized ( this ) { while (!dfsclients.isEmpty()) { I don't think we need this change and the other similar change.
        Hide
        mingma Ming Ma added a comment -

        Thanks, Colin. Here is the updated patch.

        1. It turns out closeClient isn't necessary given when LeaseRenewer has DFSClient close all open files, the last file's call into LeaseRenewer's closeFile will remove the DFSClient object. I have added the verification in the unit tests for that.
        2. The logging message is kind of misleading. elapsed measured the start time of the renewLease RPC call. So the logging will say "the lease couldn't be renewed for 30 seconds"; but the RPC retry could take several minutes. We can leave it for another jira.

        Show
        mingma Ming Ma added a comment - Thanks, Colin. Here is the updated patch. 1. It turns out closeClient isn't necessary given when LeaseRenewer has DFSClient close all open files, the last file's call into LeaseRenewer 's closeFile will remove the DFSClient object. I have added the verification in the unit tests for that. 2. The logging message is kind of misleading. elapsed measured the start time of the renewLease RPC call. So the logging will say "the lease couldn't be renewed for 30 seconds"; but the RPC retry could take several minutes. We can leave it for another jira.
        Hide
        hadoopqa Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12679945/HDFS-7314-3.patch
        against trunk revision 1670578.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 1 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. There were no new javadoc warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

        org.apache.hadoop.hdfs.TestDFSClientRetries

        The following test timeouts occurred in hadoop-hdfs-project/hadoop-hdfs:

        org.apache.hadoop.hdfs.qjournal.client.TestQuorumJournalManager

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8681//testReport/
        Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8681//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12679945/HDFS-7314-3.patch against trunk revision 1670578. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestDFSClientRetries The following test timeouts occurred in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.qjournal.client.TestQuorumJournalManager +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8681//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8681//console This message is automatically generated.
        Hide
        mingma Ming Ma added a comment -

        It turns out a new bug not related to this was discovered by this change.

        If DataStreamer thread exit and closes the stream before application closes the stream, DFSClient will keep renewing the lease. That is because DataStreamer's closeInternal marks the stream closed but didn't call DFSClient's endFileLease. Later when application closes the stream, it will skip DFSClient's endFileLease given the stream has been closed.

        So the latest patch also include the fix for leak endFileLease issue and update the unit test to verify that. We could open a separate jira for that. But without the fix for leak endFileLease issue, the patch needs to be modified to work around it.

        Show
        mingma Ming Ma added a comment - It turns out a new bug not related to this was discovered by this change. If DataStreamer thread exit and closes the stream before application closes the stream, DFSClient will keep renewing the lease. That is because DataStreamer 's closeInternal marks the stream closed but didn't call DFSClient 's endFileLease . Later when application closes the stream, it will skip DFSClient 's endFileLease given the stream has been closed. So the latest patch also include the fix for leak endFileLease issue and update the unit test to verify that. We could open a separate jira for that. But without the fix for leak endFileLease issue, the patch needs to be modified to work around it.
        Hide
        hadoopqa Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12680087/HDFS-7314-4.patch
        against trunk revision ba0a42c.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 1 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. There were no new javadoc warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The following test timeouts occurred in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs:

        org.apache.hadoop.hdfs.server.namenode.TestFsck
        org.apache.hadoop.hdfs.server.namenode.TestDeleteRace

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8686//testReport/
        Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8686//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12680087/HDFS-7314-4.patch against trunk revision ba0a42c. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The following test timeouts occurred in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.TestFsck org.apache.hadoop.hdfs.server.namenode.TestDeleteRace +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8686//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8686//console This message is automatically generated.
        Hide
        cmccabe Colin P. McCabe added a comment -

        It turns out a new bug not related to this was discovered by this change. If DataStreamer thread exit and closes the stream before application closes the stream, DFSClient will keep renewing the lease. That is because DataStreamer's closeInternal marks the stream closed but didn't call DFSClient's endFileLease. Later when application closes the stream, it will skip DFSClient's endFileLease given the stream has been closed.

        You're right that there is a bug here. There is a lot of discussion about what to do about this issue in HDFS-4504. It's not as simple as just calling endFileLease... if we missed calling completeFile, the NN will continue to think that we have a lease open on this file. I think we should avoid modifying DFSOutputStream#close here. We should try to keep this JIRA focused on just the description. Plus HDFS-4504 is a complex issue, not easy to solve.

        TestDFSClientRetries.java: let's get rid of the unnecessary whitespace change in the current patch.

        I like the idea of getting rid of the DFSClient#abort function.

        The patch looks good once these things are removed, should be ready to go soon!

        Show
        cmccabe Colin P. McCabe added a comment - It turns out a new bug not related to this was discovered by this change. If DataStreamer thread exit and closes the stream before application closes the stream, DFSClient will keep renewing the lease. That is because DataStreamer's closeInternal marks the stream closed but didn't call DFSClient's endFileLease. Later when application closes the stream, it will skip DFSClient's endFileLease given the stream has been closed. You're right that there is a bug here. There is a lot of discussion about what to do about this issue in HDFS-4504 . It's not as simple as just calling endFileLease ... if we missed calling completeFile , the NN will continue to think that we have a lease open on this file. I think we should avoid modifying DFSOutputStream#close here. We should try to keep this JIRA focused on just the description. Plus HDFS-4504 is a complex issue, not easy to solve. TestDFSClientRetries.java : let's get rid of the unnecessary whitespace change in the current patch. I like the idea of getting rid of the DFSClient#abort function. The patch looks good once these things are removed, should be ready to go soon!
        Hide
        mingma Ming Ma added a comment -

        Thanks, Colin. Didn't know "lease leak" is a known issue.

        Here is the updated patch. Given the "lease leak" issue, LeaseRenewal can't rely on closeAllFilesBeingWritten to close all leases. So it has to call CloseClient.

        testLeaseRenewSocketTimeout added to TestDFSClientRetries doesn't seem to have unnecessary whitespace. Do you mean newline? The updated patch has removed unnecessary newlines.

        Show
        mingma Ming Ma added a comment - Thanks, Colin. Didn't know "lease leak" is a known issue. Here is the updated patch. Given the "lease leak" issue, LeaseRenewal can't rely on closeAllFilesBeingWritten to close all leases. So it has to call CloseClient . testLeaseRenewSocketTimeout added to TestDFSClientRetries doesn't seem to have unnecessary whitespace. Do you mean newline? The updated patch has removed unnecessary newlines.
        Hide
        hadoopqa Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12680355/HDFS-7314-5.patch
        against trunk revision 4a114dd.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 1 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. There were no new javadoc warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8696//testReport/
        Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8696//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12680355/HDFS-7314-5.patch against trunk revision 4a114dd. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8696//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8696//console This message is automatically generated.
        Hide
        cmccabe Colin P. McCabe added a comment -
        @@ -450,10 +455,11 @@ private void run(final int id) throws InterruptedException {
                       + (elapsed/1000) + " seconds.  Aborting ...", ie);
                   synchronized (this) {
                     while (!dfsclients.isEmpty()) {
        -              dfsclients.get(0).abort();
        +              DFSClient dfsClient = dfsclients.get(0);
        +              dfsClient.closeAllFilesBeingWritten(true);
        +              closeClient(dfsClient);
                     }
                   }
        -          break;
                 } catch (IOException ie) {
                   LOG.warn("Failed to renew lease for " + clientsString() + " for "
                       + (elapsed/1000) + " seconds.  Will retry shortly ...", ie);
        

        It seems like getting rid of "break" here is going to lead to the LeaseRenewer thread for the client continuing to run after the client's lease has been aborted. This doesn't seem like what we want? After all, we are going to create a new LeaseRenewer if the DFSClient opens another file for write.

        Show
        cmccabe Colin P. McCabe added a comment - @@ -450,10 +455,11 @@ private void run( final int id) throws InterruptedException { + (elapsed/1000) + " seconds. Aborting ..." , ie); synchronized ( this ) { while (!dfsclients.isEmpty()) { - dfsclients.get(0).abort(); + DFSClient dfsClient = dfsclients.get(0); + dfsClient.closeAllFilesBeingWritten( true ); + closeClient(dfsClient); } } - break ; } catch (IOException ie) { LOG.warn( "Failed to renew lease for " + clientsString() + " for " + (elapsed/1000) + " seconds. Will retry shortly ..." , ie); It seems like getting rid of "break" here is going to lead to the LeaseRenewer thread for the client continuing to run after the client's lease has been aborted. This doesn't seem like what we want? After all, we are going to create a new LeaseRenewer if the DFSClient opens another file for write.
        Hide
        mingma Ming Ma added a comment -

        Thanks, Colin. The reason to keep the thread running is to handle the following race condition.

        1. leaseRenewal thread is aborting.
        2. The application creates files before leaseRenewal is removed from the factory. So DFSClient is added to the leaseRenewal object.
        3. leaseRenewal thread exits. So nobody will renew lease for that DFSClient.

        Show
        mingma Ming Ma added a comment - Thanks, Colin. The reason to keep the thread running is to handle the following race condition. 1. leaseRenewal thread is aborting. 2. The application creates files before leaseRenewal is removed from the factory. So DFSClient is added to the leaseRenewal object. 3. leaseRenewal thread exits. So nobody will renew lease for that DFSClient.
        Hide
        cmccabe Colin P. McCabe added a comment -

        Good catch. This code is certainly somewhat subtle. I think that the currentId variable was intended to address the problem you're describing.

        Keeping the thread running seems strange. Is it going to abort the clients it's tracking more than once? I would rather stop it if at all possible.

        It seems like maybe what we should do here is set emptyTime to 0 and break out of the loop to exit the thread. This will lead to the current LeaseRenewer thread being considered "expired" and not used in LeaseRenewer#put. So there should be no race condition then, because LeaseRenewer#put will create a new thread (and increment currentId) if the current one is expired.

        Show
        cmccabe Colin P. McCabe added a comment - Good catch. This code is certainly somewhat subtle. I think that the currentId variable was intended to address the problem you're describing. Keeping the thread running seems strange. Is it going to abort the clients it's tracking more than once? I would rather stop it if at all possible. It seems like maybe what we should do here is set emptyTime to 0 and break out of the loop to exit the thread. This will lead to the current LeaseRenewer thread being considered "expired" and not used in LeaseRenewer#put . So there should be no race condition then, because LeaseRenewer#put will create a new thread (and increment currentId ) if the current one is expired.
        Hide
        mingma Ming Ma added a comment -

        Thanks, Colin. Keeping the thread running shouldn't abort the same clients more than once. But I agree with you it is better to let the thread go.

        There is another race condition between beginFileLease and LeaseRenewer abort lease.

        1. beginFileLease calls into getLeaseRenewer, which adds the DFSClient to the LeaseRenewer's list.
        2. LeaseRenewer removes all DFSClient upon the socket timeout, including the DFSClient just added.
        3. beginFileLease continue to call LeaseRenewer's put method. It adds the file to DFSClient. But given DFSClient isn't in LeaseRenewer's list, its lease won't be renewed.

        The patch also fixes the new scenario by moving addClient to put method.

        Show
        mingma Ming Ma added a comment - Thanks, Colin. Keeping the thread running shouldn't abort the same clients more than once. But I agree with you it is better to let the thread go. There is another race condition between beginFileLease and LeaseRenewer abort lease. 1. beginFileLease calls into getLeaseRenewer , which adds the DFSClient to the LeaseRenewer's list. 2. LeaseRenewer removes all DFSClient upon the socket timeout, including the DFSClient just added. 3. beginFileLease continue to call LeaseRenewer 's put method. It adds the file to DFSClient . But given DFSClient isn't in LeaseRenewer's list, its lease won't be renewed. The patch also fixes the new scenario by moving addClient to put method.
        Hide
        hadoopqa Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12680685/HDFS-7314-6.patch
        against trunk revision 68a0508.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 1 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. There were no new javadoc warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

        org.apache.hadoop.hdfs.TestDistributedFileSystem

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8707//testReport/
        Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8707//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12680685/HDFS-7314-6.patch against trunk revision 68a0508. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestDistributedFileSystem +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8707//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8707//console This message is automatically generated.
        Hide
        mingma Ming Ma added a comment -

        Updated unit test TestDistributedFileSystem as the test has the assumption that the same LeaseRenewer object will be used even after the lease renewal thread expires; due to the fact that the test calls getLeaseRenewer() after the stream is closed.

        Given getLeaseRenewer() no longer calls addClient, the LeaseRenewer object will be released as part of lease renewal thread expiration. Thus the test needs to set the grace period value on the new object.

        Show
        mingma Ming Ma added a comment - Updated unit test TestDistributedFileSystem as the test has the assumption that the same LeaseRenewer object will be used even after the lease renewal thread expires; due to the fact that the test calls getLeaseRenewer() after the stream is closed. Given getLeaseRenewer() no longer calls addClient, the LeaseRenewer object will be released as part of lease renewal thread expiration. Thus the test needs to set the grace period value on the new object.
        Hide
        hadoopqa Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12680729/HDFS-7314-7.patch
        against trunk revision 58e9bf4.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 2 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. There were no new javadoc warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8711//testReport/
        Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8711//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12680729/HDFS-7314-7.patch against trunk revision 58e9bf4. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8711//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8711//console This message is automatically generated.
        Hide
        cmccabe Colin P. McCabe added a comment -

        I feel like this code is still not quite right. We can get two LeaseRenewer objects now, right?

        1. beginFileLease calls into getLeaseRenewer, gets LeaseRenewer #1
        2. LeaseRenewer#closeClient (for LeaseRenewer #1) removes itself from Factory.INSTANCE.
        3. another thread calls beginFileLease. There is no LeaseRenewer object in Factory.INSTANCE any more, so a new one is created (call it #2).
        4. first thread calls put, adds the DFSClient to LeaseRenewer #1 and LR1 to Factory.INSTANCE
        5. second thread calls put, adds the DFSClient to LeaseRenewer #2 and LR2 to Factory.INSTANCE.

        Won't we end up with two LeaseRenewer objects after this point?

        The problem is basically that if we allow the LeaseRenewer object to escape from LeaseRenewer.java, and we accept that these objects can "die", we have to accept that people can be using dead LeaseRenewer objects.

        I'm not sure what the best way to fix this is... it is kind of a mess. I guess maybe it's a pre-existing problem too? If I'm understanding the situation correctly.

        Show
        cmccabe Colin P. McCabe added a comment - I feel like this code is still not quite right. We can get two LeaseRenewer objects now, right? 1. beginFileLease calls into getLeaseRenewer, gets LeaseRenewer #1 2. LeaseRenewer#closeClient (for LeaseRenewer #1) removes itself from Factory.INSTANCE. 3. another thread calls beginFileLease. There is no LeaseRenewer object in Factory.INSTANCE any more, so a new one is created (call it #2). 4. first thread calls put, adds the DFSClient to LeaseRenewer #1 and LR1 to Factory.INSTANCE 5. second thread calls put, adds the DFSClient to LeaseRenewer #2 and LR2 to Factory.INSTANCE. Won't we end up with two LeaseRenewer objects after this point? The problem is basically that if we allow the LeaseRenewer object to escape from LeaseRenewer.java, and we accept that these objects can "die", we have to accept that people can be using dead LeaseRenewer objects. I'm not sure what the best way to fix this is... it is kind of a mess. I guess maybe it's a pre-existing problem too? If I'm understanding the situation correctly.
        Hide
        mingma Ming Ma added a comment -

        Thanks Colin for the good point. I also noticed that during the analysis; but assumed that is part of the original design.

        1. The issue you described is in trunk. It can happen when LeaseRenewer goes away, due to SocketTimeoutException or RenewerExpired.
        2. In your above steps, LeaseRenewer object is added to Factory.INSTANCE in #3, not in step #4 and #5. But that doesn't change the issue. What will happen is when the first thread calls endFileLease, it will get hold of LR2. So LR1 will keep renewing the lease even after all files have been closed.

        It appears we have discovered a bunch of race conditions regardless whether the original issue is addressed or not. Given that, we can consider fixing the original issue and open another jira to address these race conditions.

        As you mentioned, the issues come from the fact that LeaseRenewer tries to clean up the object and the thread when they are no longer used. IMHO, that is not necessary; we can just keep LeaseRenewer objects and their threads around once they are created, the idea in the original patch. LeaseRenewer objects are keyed by NN address and ugi. In the normal set up, with HDFS federation you can have several NN addresses, but # of ugis should be limited. So it isn't expensive to keep these objects and their threads around.

        If the long term fix is to keep the LeaseRenewer object and thread around, we can start with fix for SocketTimeoutException in this patch and open another patch to address the RenewerExpired scenario later.

        Show
        mingma Ming Ma added a comment - Thanks Colin for the good point. I also noticed that during the analysis; but assumed that is part of the original design. 1. The issue you described is in trunk. It can happen when LeaseRenewer goes away, due to SocketTimeoutException or RenewerExpired. 2. In your above steps, LeaseRenewer object is added to Factory.INSTANCE in #3, not in step #4 and #5. But that doesn't change the issue. What will happen is when the first thread calls endFileLease, it will get hold of LR2. So LR1 will keep renewing the lease even after all files have been closed. It appears we have discovered a bunch of race conditions regardless whether the original issue is addressed or not. Given that, we can consider fixing the original issue and open another jira to address these race conditions. As you mentioned, the issues come from the fact that LeaseRenewer tries to clean up the object and the thread when they are no longer used. IMHO, that is not necessary; we can just keep LeaseRenewer objects and their threads around once they are created, the idea in the original patch. LeaseRenewer objects are keyed by NN address and ugi. In the normal set up, with HDFS federation you can have several NN addresses, but # of ugis should be limited. So it isn't expensive to keep these objects and their threads around. If the long term fix is to keep the LeaseRenewer object and thread around, we can start with fix for SocketTimeoutException in this patch and open another patch to address the RenewerExpired scenario later.
        Hide
        cmccabe Colin P. McCabe added a comment -

        I need to think about this more. I think the root of the problem is the decision to expose the LeaseManager thread object instances outside LeaseManager.java, rather than simply having static methods (or the equivalent) that act on "the current lease manager for your UGI", without forcing you to know or care what that is.

        I am fine with fixing this in another JIRA, but I really feel like we should fix it first. I don't feel good about the current synchronization at all.

        Thanks for your patience, Ming Ma.

        Show
        cmccabe Colin P. McCabe added a comment - I need to think about this more. I think the root of the problem is the decision to expose the LeaseManager thread object instances outside LeaseManager.java, rather than simply having static methods (or the equivalent) that act on "the current lease manager for your UGI", without forcing you to know or care what that is. I am fine with fixing this in another JIRA, but I really feel like we should fix it first. I don't feel good about the current synchronization at all. Thanks for your patience, Ming Ma .
        Hide
        mingma Ming Ma added a comment -

        Thanks, Colin.

        1. There is an existing static method called LeaseRenewer#getInstance. There are synchronizations at methods of LeaseRenewer and LeaseRenewer#Factory. But that is only synchronized at each class instance level. Some of these race conditions come from the lack of synchronization between class instances. We can try to fix those scenarios.

        2. Alternatively, we can get rid of the LeaseRenewer thread/object recycle logic. For short duration program like MR job submission, it won't kick in anyway. For long running services like YARN, it doesn't really matter as they create several long running threads it should be ok to keep few LeaseRenewer threads around. In addition; given these services might use HDFS regularly, LeaseRenewer threads will be recreated or just kept around anyway.

        Show
        mingma Ming Ma added a comment - Thanks, Colin. 1. There is an existing static method called LeaseRenewer # getInstance . There are synchronizations at methods of LeaseRenewer and LeaseRenewer#Factory . But that is only synchronized at each class instance level. Some of these race conditions come from the lack of synchronization between class instances. We can try to fix those scenarios. 2. Alternatively, we can get rid of the LeaseRenewer thread/object recycle logic. For short duration program like MR job submission, it won't kick in anyway. For long running services like YARN, it doesn't really matter as they create several long running threads it should be ok to keep few LeaseRenewer threads around. In addition; given these services might use HDFS regularly, LeaseRenewer threads will be recreated or just kept around anyway.
        Hide
        jira.shegalov Gera Shegalov added a comment -

        I think the real problem is that the FileSystem-level CACHE entry is not invalidated/evicted although the DFS Client is closed.

        1. DistributedFileSystem#close does not call super.close() that would achieve this.
        2. DFSClient#abort does not close the wrapping DFS object nor DFS tries to intercept checkOpen to do this.

        Solving these issues would solve the scenario described in the JIRA. What do you think, Ming Ma?

        Show
        jira.shegalov Gera Shegalov added a comment - I think the real problem is that the FileSystem -level CACHE entry is not invalidated/evicted although the DFS Client is closed. DistributedFileSystem#close does not call super.close() that would achieve this. DFSClient#abort does not close the wrapping DFS object nor DFS tries to intercept checkOpen to do this. Solving these issues would solve the scenario described in the JIRA. What do you think, Ming Ma ?
        Hide
        hadoopqa Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12680729/HDFS-7314-7.patch
        against trunk revision 5a0051f.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 2 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. There were no new javadoc warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

        Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/9366//testReport/
        Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9366//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12680729/HDFS-7314-7.patch against trunk revision 5a0051f. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/9366//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9366//console This message is automatically generated.
        Hide
        mingma Ming Ma added a comment -

        Thanks, Gera Shegalov. That is interesting. That might work when applications request new FileSystem object. However, there is the scenario where applications still hold the reference of aborted FileSystem object and want to use that to create files; then applications need to be modified to catch the exception and recreate the FileSystem object? At the beginning of the jira, one of the 3 solutions proposed is to keep DistributedFileSystem alive and recreate DFSClient. Regarding of the approach, it will be good to keep it transparent to the applications.

        Show
        mingma Ming Ma added a comment - Thanks, Gera Shegalov . That is interesting. That might work when applications request new FileSystem object. However, there is the scenario where applications still hold the reference of aborted FileSystem object and want to use that to create files; then applications need to be modified to catch the exception and recreate the FileSystem object? At the beginning of the jira, one of the 3 solutions proposed is to keep DistributedFileSystem alive and recreate DFSClient. Regarding of the approach, it will be good to keep it transparent to the applications.
        Hide
        jira.shegalov Gera Shegalov added a comment -

        Actually I need to take #1 back, I misspoke DFS#close calls super.close()

          @Override
          public void close() throws IOException {
            try {
              dfs.closeOutputStreams(false);
              super.close();
            } finally {
              dfs.close();
            }
          }
        

        So it's only about 2.

        Show
        jira.shegalov Gera Shegalov added a comment - Actually I need to take #1 back, I misspoke DFS#close calls super.close() @Override public void close() throws IOException { try { dfs.closeOutputStreams( false ); super .close(); } finally { dfs.close(); } } So it's only about 2.
        Hide
        mingma Ming Ma added a comment -

        Here is a slightly different version we have deployed on our production clusters. It doesn't address all the possible race conditions discussed above; but it should take care of the immediate issue.

        The question is if we should use this jira to address these race conditions systematically. Getting rid of LeaseRenewer expiry is one way to tackle that. We can just keep LeaseRenewer objects and their threads around once they have been created. Thoughts?

        Show
        mingma Ming Ma added a comment - Here is a slightly different version we have deployed on our production clusters. It doesn't address all the possible race conditions discussed above; but it should take care of the immediate issue. The question is if we should use this jira to address these race conditions systematically. Getting rid of LeaseRenewer expiry is one way to tackle that. We can just keep LeaseRenewer objects and their threads around once they have been created. Thoughts?
        Hide
        hadoopqa Hadoop QA added a comment -



        -1 overall



        Vote Subsystem Runtime Comment
        -1 pre-patch 17m 38s Pre-patch trunk has 1 extant Findbugs (version 3.0.0) warnings.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 tests included 0m 0s The patch appears to include 1 new or modified test files.
        -1 javac 8m 9s The applied patch generated 1 additional warning messages.
        +1 javadoc 10m 6s There were no new javadoc warning messages.
        +1 release audit 0m 21s The applied patch does not increase the total number of release audit warnings.
        -1 checkstyle 1m 26s The applied patch generated 3 new checkstyle issues (total was 138, now 139).
        +1 whitespace 0m 0s The patch has no lines that end in whitespace.
        +1 install 1m 27s mvn install still works.
        +1 eclipse:eclipse 0m 33s The patch built with eclipse:eclipse.
        +1 findbugs 2m 37s The patch does not introduce any new Findbugs (version 3.0.0) warnings.
        +1 native 3m 10s Pre-build of native portion
        -1 hdfs tests 159m 27s Tests failed in hadoop-hdfs.
            204m 59s  



        Reason Tests
        Failed unit tests hadoop.hdfs.TestLeaseRecovery2



        Subsystem Report/Notes
        Patch URL http://issues.apache.org/jira/secure/attachment/12744319/HDFS-7314-8.patch
        Optional Tests javadoc javac unit findbugs checkstyle
        git revision trunk / 2e3d83f
        Pre-patch Findbugs warnings https://builds.apache.org/job/PreCommit-HDFS-Build/11630/artifact/patchprocess/trunkFindbugsWarningshadoop-hdfs.html
        javac https://builds.apache.org/job/PreCommit-HDFS-Build/11630/artifact/patchprocess/diffJavacWarnings.txt
        checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/11630/artifact/patchprocess/diffcheckstylehadoop-hdfs.txt
        hadoop-hdfs test log https://builds.apache.org/job/PreCommit-HDFS-Build/11630/artifact/patchprocess/testrun_hadoop-hdfs.txt
        Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/11630/testReport/
        Java 1.7.0_55
        uname Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Console output https://builds.apache.org/job/PreCommit-HDFS-Build/11630/console

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment -1 pre-patch 17m 38s Pre-patch trunk has 1 extant Findbugs (version 3.0.0) warnings. +1 @author 0m 0s The patch does not contain any @author tags. +1 tests included 0m 0s The patch appears to include 1 new or modified test files. -1 javac 8m 9s The applied patch generated 1 additional warning messages. +1 javadoc 10m 6s There were no new javadoc warning messages. +1 release audit 0m 21s The applied patch does not increase the total number of release audit warnings. -1 checkstyle 1m 26s The applied patch generated 3 new checkstyle issues (total was 138, now 139). +1 whitespace 0m 0s The patch has no lines that end in whitespace. +1 install 1m 27s mvn install still works. +1 eclipse:eclipse 0m 33s The patch built with eclipse:eclipse. +1 findbugs 2m 37s The patch does not introduce any new Findbugs (version 3.0.0) warnings. +1 native 3m 10s Pre-build of native portion -1 hdfs tests 159m 27s Tests failed in hadoop-hdfs.     204m 59s   Reason Tests Failed unit tests hadoop.hdfs.TestLeaseRecovery2 Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12744319/HDFS-7314-8.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / 2e3d83f Pre-patch Findbugs warnings https://builds.apache.org/job/PreCommit-HDFS-Build/11630/artifact/patchprocess/trunkFindbugsWarningshadoop-hdfs.html javac https://builds.apache.org/job/PreCommit-HDFS-Build/11630/artifact/patchprocess/diffJavacWarnings.txt checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/11630/artifact/patchprocess/diffcheckstylehadoop-hdfs.txt hadoop-hdfs test log https://builds.apache.org/job/PreCommit-HDFS-Build/11630/artifact/patchprocess/testrun_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/11630/testReport/ Java 1.7.0_55 uname Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-HDFS-Build/11630/console This message was automatically generated.
        Hide
        cmccabe Colin P. McCabe added a comment -

        Thanks for picking this up again, Ming Ma.

                    emptyTime = 0;
        

        Can we have a comment on this line explaining that the purpose of setting this to 0 is to make the renewer seem to be expired?

        +1 once that's done.

        I still feel like the synchronization could use some work here. The thread stopping and starting logic is very complex and I feel that it could be simplified a lot with something like a periodic ExecutorService. But this patch doesn't make it any worse than it currently is, and it fixes some major issues for us.

        Show
        cmccabe Colin P. McCabe added a comment - Thanks for picking this up again, Ming Ma . emptyTime = 0; Can we have a comment on this line explaining that the purpose of setting this to 0 is to make the renewer seem to be expired? +1 once that's done. I still feel like the synchronization could use some work here. The thread stopping and starting logic is very complex and I feel that it could be simplified a lot with something like a periodic ExecutorService. But this patch doesn't make it any worse than it currently is, and it fixes some major issues for us.
        Hide
        mingma Ming Ma added a comment -

        Thanks Colin P. McCabe. Here is the update patch with your suggestion. I will open a new jira to continue the discussion around refactoring of lease renewal to address these race conditions systematically.

        Show
        mingma Ming Ma added a comment - Thanks Colin P. McCabe . Here is the update patch with your suggestion. I will open a new jira to continue the discussion around refactoring of lease renewal to address these race conditions systematically.
        Hide
        hadoopqa Hadoop QA added a comment -



        -1 overall



        Vote Subsystem Runtime Comment
        0 pre-patch 17m 4s Pre-patch trunk compilation is healthy.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 tests included 0m 0s The patch appears to include 1 new or modified test files.
        -1 javac 7m 33s The applied patch generated 1 additional warning messages.
        +1 javadoc 9m 42s There were no new javadoc warning messages.
        +1 release audit 0m 23s The applied patch does not increase the total number of release audit warnings.
        -1 checkstyle 1m 20s The applied patch generated 2 new checkstyle issues (total was 137, now 137).
        +1 whitespace 0m 1s The patch has no lines that end in whitespace.
        +1 install 1m 20s mvn install still works.
        +1 eclipse:eclipse 0m 32s The patch built with eclipse:eclipse.
        +1 findbugs 2m 32s The patch does not introduce any new Findbugs (version 3.0.0) warnings.
        +1 native 3m 3s Pre-build of native portion
        -1 hdfs tests 160m 17s Tests failed in hadoop-hdfs.
            203m 50s  



        Reason Tests
        Failed unit tests hadoop.hdfs.TestRollingUpgrade
          hadoop.hdfs.server.namenode.ha.TestStandbyIsHot



        Subsystem Report/Notes
        Patch URL http://issues.apache.org/jira/secure/attachment/12745179/HDFS-7314-9.patch
        Optional Tests javadoc javac unit findbugs checkstyle
        git revision trunk / a431ed9
        javac https://builds.apache.org/job/PreCommit-HDFS-Build/11692/artifact/patchprocess/diffJavacWarnings.txt
        checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/11692/artifact/patchprocess/diffcheckstylehadoop-hdfs.txt
        hadoop-hdfs test log https://builds.apache.org/job/PreCommit-HDFS-Build/11692/artifact/patchprocess/testrun_hadoop-hdfs.txt
        Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/11692/testReport/
        Java 1.7.0_55
        uname Linux asf909.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Console output https://builds.apache.org/job/PreCommit-HDFS-Build/11692/console

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 pre-patch 17m 4s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. +1 tests included 0m 0s The patch appears to include 1 new or modified test files. -1 javac 7m 33s The applied patch generated 1 additional warning messages. +1 javadoc 9m 42s There were no new javadoc warning messages. +1 release audit 0m 23s The applied patch does not increase the total number of release audit warnings. -1 checkstyle 1m 20s The applied patch generated 2 new checkstyle issues (total was 137, now 137). +1 whitespace 0m 1s The patch has no lines that end in whitespace. +1 install 1m 20s mvn install still works. +1 eclipse:eclipse 0m 32s The patch built with eclipse:eclipse. +1 findbugs 2m 32s The patch does not introduce any new Findbugs (version 3.0.0) warnings. +1 native 3m 3s Pre-build of native portion -1 hdfs tests 160m 17s Tests failed in hadoop-hdfs.     203m 50s   Reason Tests Failed unit tests hadoop.hdfs.TestRollingUpgrade   hadoop.hdfs.server.namenode.ha.TestStandbyIsHot Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12745179/HDFS-7314-9.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / a431ed9 javac https://builds.apache.org/job/PreCommit-HDFS-Build/11692/artifact/patchprocess/diffJavacWarnings.txt checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/11692/artifact/patchprocess/diffcheckstylehadoop-hdfs.txt hadoop-hdfs test log https://builds.apache.org/job/PreCommit-HDFS-Build/11692/artifact/patchprocess/testrun_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/11692/testReport/ Java 1.7.0_55 uname Linux asf909.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-HDFS-Build/11692/console This message was automatically generated.
        Hide
        cmccabe Colin P. McCabe added a comment -

        Thanks, Ming Ma. +1.

        Show
        cmccabe Colin P. McCabe added a comment - Thanks, Ming Ma . +1.
        Hide
        mingma Ming Ma added a comment -

        I have committed the patch to trunk and branch-2. Thanks Colin P. McCabe Gera Shegalov Lohit Vijayarenu Kihwal Lee for the review and suggestion.

        Show
        mingma Ming Ma added a comment - I have committed the patch to trunk and branch-2. Thanks Colin P. McCabe Gera Shegalov Lohit Vijayarenu Kihwal Lee for the review and suggestion.
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-trunk-Commit #8174 (See https://builds.apache.org/job/Hadoop-trunk-Commit/8174/)
        HDFS-7314. When the DFSClient lease cannot be renewed, abort open-for-write files rather than the entire DFSClient. (mingma) (mingma: rev fbd88f1062f3c4b208724d208e3f501eb196dfab)

        • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/client/impl/LeaseRenewer.java
        • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java
        • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSClientRetries.java
        • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Move HDFS-7314 to 2.8 section in CHANGES.txt (mingma: rev 0bda84fd48681ac1748a4770cff2f23e8336d276)
        • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #8174 (See https://builds.apache.org/job/Hadoop-trunk-Commit/8174/ ) HDFS-7314 . When the DFSClient lease cannot be renewed, abort open-for-write files rather than the entire DFSClient. (mingma) (mingma: rev fbd88f1062f3c4b208724d208e3f501eb196dfab) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/client/impl/LeaseRenewer.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSClientRetries.java hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt Move HDFS-7314 to 2.8 section in CHANGES.txt (mingma: rev 0bda84fd48681ac1748a4770cff2f23e8336d276) hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #247 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/247/)
        HDFS-7314. When the DFSClient lease cannot be renewed, abort open-for-write files rather than the entire DFSClient. (mingma) (mingma: rev fbd88f1062f3c4b208724d208e3f501eb196dfab)

        • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSClientRetries.java
        • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java
        • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/client/impl/LeaseRenewer.java
          Move HDFS-7314 to 2.8 section in CHANGES.txt (mingma: rev 0bda84fd48681ac1748a4770cff2f23e8336d276)
        • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #247 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/247/ ) HDFS-7314 . When the DFSClient lease cannot be renewed, abort open-for-write files rather than the entire DFSClient. (mingma) (mingma: rev fbd88f1062f3c4b208724d208e3f501eb196dfab) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSClientRetries.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/client/impl/LeaseRenewer.java Move HDFS-7314 to 2.8 section in CHANGES.txt (mingma: rev 0bda84fd48681ac1748a4770cff2f23e8336d276) hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Hadoop-Yarn-trunk-Java8 #259 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/259/)
        HDFS-7314. When the DFSClient lease cannot be renewed, abort open-for-write files rather than the entire DFSClient. (mingma) (mingma: rev fbd88f1062f3c4b208724d208e3f501eb196dfab)

        • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java
        • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/client/impl/LeaseRenewer.java
        • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSClientRetries.java
          Move HDFS-7314 to 2.8 section in CHANGES.txt (mingma: rev 0bda84fd48681ac1748a4770cff2f23e8336d276)
        • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Yarn-trunk-Java8 #259 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/259/ ) HDFS-7314 . When the DFSClient lease cannot be renewed, abort open-for-write files rather than the entire DFSClient. (mingma) (mingma: rev fbd88f1062f3c4b208724d208e3f501eb196dfab) hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/client/impl/LeaseRenewer.java hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSClientRetries.java Move HDFS-7314 to 2.8 section in CHANGES.txt (mingma: rev 0bda84fd48681ac1748a4770cff2f23e8336d276) hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Hadoop-Yarn-trunk #989 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/989/)
        HDFS-7314. When the DFSClient lease cannot be renewed, abort open-for-write files rather than the entire DFSClient. (mingma) (mingma: rev fbd88f1062f3c4b208724d208e3f501eb196dfab)

        • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/client/impl/LeaseRenewer.java
        • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSClientRetries.java
        • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java
          Move HDFS-7314 to 2.8 section in CHANGES.txt (mingma: rev 0bda84fd48681ac1748a4770cff2f23e8336d276)
        • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Yarn-trunk #989 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/989/ ) HDFS-7314 . When the DFSClient lease cannot be renewed, abort open-for-write files rather than the entire DFSClient. (mingma) (mingma: rev fbd88f1062f3c4b208724d208e3f501eb196dfab) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/client/impl/LeaseRenewer.java hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSClientRetries.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java Move HDFS-7314 to 2.8 section in CHANGES.txt (mingma: rev 0bda84fd48681ac1748a4770cff2f23e8336d276) hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #256 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/256/)
        HDFS-7314. When the DFSClient lease cannot be renewed, abort open-for-write files rather than the entire DFSClient. (mingma) (mingma: rev fbd88f1062f3c4b208724d208e3f501eb196dfab)

        • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSClientRetries.java
        • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java
        • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/client/impl/LeaseRenewer.java
        • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Move HDFS-7314 to 2.8 section in CHANGES.txt (mingma: rev 0bda84fd48681ac1748a4770cff2f23e8336d276)
        • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #256 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/256/ ) HDFS-7314 . When the DFSClient lease cannot be renewed, abort open-for-write files rather than the entire DFSClient. (mingma) (mingma: rev fbd88f1062f3c4b208724d208e3f501eb196dfab) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSClientRetries.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/client/impl/LeaseRenewer.java hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt Move HDFS-7314 to 2.8 section in CHANGES.txt (mingma: rev 0bda84fd48681ac1748a4770cff2f23e8336d276) hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Hdfs-trunk #2186 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2186/)
        HDFS-7314. When the DFSClient lease cannot be renewed, abort open-for-write files rather than the entire DFSClient. (mingma) (mingma: rev fbd88f1062f3c4b208724d208e3f501eb196dfab)

        • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSClientRetries.java
        • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/client/impl/LeaseRenewer.java
        • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java
          Move HDFS-7314 to 2.8 section in CHANGES.txt (mingma: rev 0bda84fd48681ac1748a4770cff2f23e8336d276)
        • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk #2186 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2186/ ) HDFS-7314 . When the DFSClient lease cannot be renewed, abort open-for-write files rather than the entire DFSClient. (mingma) (mingma: rev fbd88f1062f3c4b208724d208e3f501eb196dfab) hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSClientRetries.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/client/impl/LeaseRenewer.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java Move HDFS-7314 to 2.8 section in CHANGES.txt (mingma: rev 0bda84fd48681ac1748a4770cff2f23e8336d276) hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Mapreduce-trunk #2205 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2205/)
        HDFS-7314. When the DFSClient lease cannot be renewed, abort open-for-write files rather than the entire DFSClient. (mingma) (mingma: rev fbd88f1062f3c4b208724d208e3f501eb196dfab)

        • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSClientRetries.java
        • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/client/impl/LeaseRenewer.java
        • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java
          Move HDFS-7314 to 2.8 section in CHANGES.txt (mingma: rev 0bda84fd48681ac1748a4770cff2f23e8336d276)
        • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk #2205 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2205/ ) HDFS-7314 . When the DFSClient lease cannot be renewed, abort open-for-write files rather than the entire DFSClient. (mingma) (mingma: rev fbd88f1062f3c4b208724d208e3f501eb196dfab) hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSClientRetries.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/client/impl/LeaseRenewer.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java Move HDFS-7314 to 2.8 section in CHANGES.txt (mingma: rev 0bda84fd48681ac1748a4770cff2f23e8336d276) hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        Hide
        vinodkv Vinod Kumar Vavilapalli added a comment -

        This has the 2.6.1-candidate label, but it seems like it is already pulled into 2.6.1.

        Ran compilation and TestDFSClientRetries anyways just to be sure.

        Show
        vinodkv Vinod Kumar Vavilapalli added a comment - This has the 2.6.1-candidate label, but it seems like it is already pulled into 2.6.1. Ran compilation and TestDFSClientRetries anyways just to be sure.
        Hide
        vinodkv Vinod Kumar Vavilapalli added a comment -

        Just pulled this into branch-2.7 (release 2.7.2) as it already exists in 2.6.1.

        branch-2 patch had merge conflicts. Ran compilation and TestDFSClientRetriesbefore the push.

        Show
        vinodkv Vinod Kumar Vavilapalli added a comment - Just pulled this into branch-2.7 (release 2.7.2) as it already exists in 2.6.1. branch-2 patch had merge conflicts. Ran compilation and TestDFSClientRetriesbefore the push.
        Hide
        vinodkv Vinod Kumar Vavilapalli added a comment -

        Attaching patch that I committed to 2.7.2.

        Show
        vinodkv Vinod Kumar Vavilapalli added a comment - Attaching patch that I committed to 2.7.2.
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-trunk-Commit #8432 (See https://builds.apache.org/job/Hadoop-trunk-Commit/8432/)
        HDFS-7314. Moving to 2.6.1 CHANGES.txt section. (vinodkv: rev f103a70af5c5b01931b5cd2e5782eac5aeeb31cd)

        • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #8432 (See https://builds.apache.org/job/Hadoop-trunk-Commit/8432/ ) HDFS-7314 . Moving to 2.6.1 CHANGES.txt section. (vinodkv: rev f103a70af5c5b01931b5cd2e5782eac5aeeb31cd) hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Yarn-trunk #1108 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/1108/)
        HDFS-7314. Moving to 2.6.1 CHANGES.txt section. (vinodkv: rev f103a70af5c5b01931b5cd2e5782eac5aeeb31cd)

        • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk #1108 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/1108/ ) HDFS-7314 . Moving to 2.6.1 CHANGES.txt section. (vinodkv: rev f103a70af5c5b01931b5cd2e5782eac5aeeb31cd) hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #370 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/370/)
        HDFS-7314. Moving to 2.6.1 CHANGES.txt section. (vinodkv: rev f103a70af5c5b01931b5cd2e5782eac5aeeb31cd)

        • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #370 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/370/ ) HDFS-7314 . Moving to 2.6.1 CHANGES.txt section. (vinodkv: rev f103a70af5c5b01931b5cd2e5782eac5aeeb31cd) hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Mapreduce-trunk #2318 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2318/)
        HDFS-7314. Moving to 2.6.1 CHANGES.txt section. (vinodkv: rev f103a70af5c5b01931b5cd2e5782eac5aeeb31cd)

        • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk #2318 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2318/ ) HDFS-7314 . Moving to 2.6.1 CHANGES.txt section. (vinodkv: rev f103a70af5c5b01931b5cd2e5782eac5aeeb31cd) hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #376 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/376/)
        HDFS-7314. Moving to 2.6.1 CHANGES.txt section. (vinodkv: rev f103a70af5c5b01931b5cd2e5782eac5aeeb31cd)

        • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #376 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/376/ ) HDFS-7314 . Moving to 2.6.1 CHANGES.txt section. (vinodkv: rev f103a70af5c5b01931b5cd2e5782eac5aeeb31cd) hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #356 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/356/)
        HDFS-7314. Moving to 2.6.1 CHANGES.txt section. (vinodkv: rev f103a70af5c5b01931b5cd2e5782eac5aeeb31cd)

        • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #356 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/356/ ) HDFS-7314 . Moving to 2.6.1 CHANGES.txt section. (vinodkv: rev f103a70af5c5b01931b5cd2e5782eac5aeeb31cd) hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Hadoop-Hdfs-trunk #2295 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2295/)
        HDFS-7314. Moving to 2.6.1 CHANGES.txt section. (vinodkv: rev f103a70af5c5b01931b5cd2e5782eac5aeeb31cd)

        • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Hdfs-trunk #2295 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2295/ ) HDFS-7314 . Moving to 2.6.1 CHANGES.txt section. (vinodkv: rev f103a70af5c5b01931b5cd2e5782eac5aeeb31cd) hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt

          People

          • Assignee:
            mingma Ming Ma
            Reporter:
            mingma Ming Ma
          • Votes:
            0 Vote for this issue
            Watchers:
            18 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development