Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.6.0
    • Fix Version/s: 2.8.0, 3.0.0-alpha2
    • Component/s: hdfs-client
    • Labels:
      None

      Description

      Creating this ticket on behalf of Daryn Sharp

      We've seen this in our of our cluster. When a long running process has a write failure, the lease is leaked and gets renewed until the token is expired.

        Issue Links

          Activity

          Hide
          sjlee0 Sangjin Lee added a comment -

          Should this be targeted to 2.6.2? We're trying to release 2.6.1 soon. Let me know.

          Show
          sjlee0 Sangjin Lee added a comment - Should this be targeted to 2.6.2? We're trying to release 2.6.1 soon. Let me know.
          Hide
          sjlee0 Sangjin Lee added a comment -

          Unless the patch is ready to go and the JIRA is a critical fix, we'll defer it to 2.6.2. Let me know if you have comments. Thanks!

          Show
          sjlee0 Sangjin Lee added a comment - Unless the patch is ready to go and the JIRA is a critical fix, we'll defer it to 2.6.2. Let me know if you have comments. Thanks!
          Hide
          daryn Daryn Sharp added a comment -

          A combination of errors occurred. Pipelines were frequently breaking because the cluster erroneously "thought" it was full. Mis-accounting bugs in the RBW reserved space and storage report contributed to the problem but almost full clusters will exhibit the same problems. A thread leaks and continues to renew the lease on a defunct file.

          Didn't seem like a big deal until we saw it in long running daemons. Then it was the NMs. Consider log aggregation pipelines breaking, NMs leaking dozens or hundreds of renewer threads, over thousands of nodes, NN has an insane number of open connections nearing your "this will never happen" fd limit, clogging it with worthless renewals. Now it gets good. The renewer threads won't abort until the token expires. Oh, you don't have security enabled? Better restart your NMs, hdfs proxies, oozies, DNs (webhdfs writes), hbase region servers, etc...

          I'm swamped and if you want to wait till 2.6.2, I'm ok.

          Show
          daryn Daryn Sharp added a comment - A combination of errors occurred. Pipelines were frequently breaking because the cluster erroneously "thought" it was full. Mis-accounting bugs in the RBW reserved space and storage report contributed to the problem but almost full clusters will exhibit the same problems. A thread leaks and continues to renew the lease on a defunct file. Didn't seem like a big deal until we saw it in long running daemons. Then it was the NMs. Consider log aggregation pipelines breaking, NMs leaking dozens or hundreds of renewer threads, over thousands of nodes, NN has an insane number of open connections nearing your "this will never happen" fd limit, clogging it with worthless renewals. Now it gets good. The renewer threads won't abort until the token expires. Oh, you don't have security enabled? Better restart your NMs, hdfs proxies, oozies, DNs (webhdfs writes), hbase region servers, etc... I'm swamped and if you want to wait till 2.6.2, I'm ok.
          Hide
          sjlee0 Sangjin Lee added a comment -

          Targeting 2.6.3 now that 2.6.2 has shipped.

          Show
          sjlee0 Sangjin Lee added a comment - Targeting 2.6.3 now that 2.6.2 has shipped.
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          Moving out all non-critical / non-blocker issues that didn't make it out of 2.7.2 into 2.7.3.

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - Moving out all non-critical / non-blocker issues that didn't make it out of 2.7.2 into 2.7.3.
          Hide
          djp Junping Du added a comment -

          Hi, can we move this out of 2.6.3? Thanks!

          Show
          djp Junping Du added a comment - Hi, can we move this out of 2.6.3? Thanks!
          Hide
          djp Junping Du added a comment -

          Move it to 2.6.4 as the JIRA haven't been update for a while.

          Show
          djp Junping Du added a comment - Move it to 2.6.4 as the JIRA haven't been update for a while.
          Hide
          djp Junping Du added a comment -

          Hi Daryn Sharp, is this issue related to HDFS-9294? Thanks!

          Show
          djp Junping Du added a comment - Hi Daryn Sharp , is this issue related to HDFS-9294 ? Thanks!
          Hide
          djp Junping Du added a comment -

          Move it to 2.6.5 as the JIRA haven't been update for a while.

          Show
          djp Junping Du added a comment - Move it to 2.6.5 as the JIRA haven't been update for a while.
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          2.7.3 is under release process, changing target-version to 2.7.4.

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - 2.7.3 is under release process, changing target-version to 2.7.4.
          Hide
          ctrezzo Chris Trezzo added a comment -

          Moving this issue to 2.6.6. Please move back if you feel otherwise.

          Show
          ctrezzo Chris Trezzo added a comment - Moving this issue to 2.6.6. Please move back if you feel otherwise.
          Hide
          kshukla Kuhu Shukla added a comment -

          Initial patch that moves endFileLease to setClosed(). Includes a simple test for this move.

          Show
          kshukla Kuhu Shukla added a comment - Initial patch that moves endFileLease to setClosed() . Includes a simple test for this move.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 11s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          0 mvndep 0m 28s Maven dependency ordering for branch
          +1 mvninstall 7m 44s trunk passed
          +1 compile 1m 22s trunk passed
          +1 checkstyle 0m 31s trunk passed
          +1 mvnsite 1m 29s trunk passed
          +1 mvneclipse 0m 25s trunk passed
          +1 findbugs 3m 16s trunk passed
          +1 javadoc 1m 1s trunk passed
          0 mvndep 0m 7s Maven dependency ordering for patch
          +1 mvninstall 1m 16s the patch passed
          +1 compile 1m 23s the patch passed
          +1 javac 1m 23s the patch passed
          +1 checkstyle 0m 29s the patch passed
          +1 mvnsite 1m 24s the patch passed
          +1 mvneclipse 0m 19s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 3m 20s the patch passed
          +1 javadoc 0m 54s the patch passed
          +1 unit 0m 56s hadoop-hdfs-client in the patch passed.
          -1 unit 60m 26s hadoop-hdfs in the patch failed.
          +1 asflicense 0m 20s The patch does not generate ASF License warnings.
          88m 39s



          Reason Tests
          Failed junit tests hadoop.security.TestPermission
            hadoop.hdfs.server.datanode.TestDataNodeRollingUpgrade



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:9560f25
          JIRA Issue HDFS-8870
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12835624/HDFS-8870.001.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux eccac35548ca 3.13.0-93-generic #140-Ubuntu SMP Mon Jul 18 21:21:05 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / ac35ee9
          Default Java 1.8.0_101
          findbugs v3.0.0
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/17330/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
          Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/17330/testReport/
          modules C: hadoop-hdfs-project/hadoop-hdfs-client hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/17330/console
          Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 11s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. 0 mvndep 0m 28s Maven dependency ordering for branch +1 mvninstall 7m 44s trunk passed +1 compile 1m 22s trunk passed +1 checkstyle 0m 31s trunk passed +1 mvnsite 1m 29s trunk passed +1 mvneclipse 0m 25s trunk passed +1 findbugs 3m 16s trunk passed +1 javadoc 1m 1s trunk passed 0 mvndep 0m 7s Maven dependency ordering for patch +1 mvninstall 1m 16s the patch passed +1 compile 1m 23s the patch passed +1 javac 1m 23s the patch passed +1 checkstyle 0m 29s the patch passed +1 mvnsite 1m 24s the patch passed +1 mvneclipse 0m 19s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 3m 20s the patch passed +1 javadoc 0m 54s the patch passed +1 unit 0m 56s hadoop-hdfs-client in the patch passed. -1 unit 60m 26s hadoop-hdfs in the patch failed. +1 asflicense 0m 20s The patch does not generate ASF License warnings. 88m 39s Reason Tests Failed junit tests hadoop.security.TestPermission   hadoop.hdfs.server.datanode.TestDataNodeRollingUpgrade Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Issue HDFS-8870 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12835624/HDFS-8870.001.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux eccac35548ca 3.13.0-93-generic #140-Ubuntu SMP Mon Jul 18 21:21:05 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / ac35ee9 Default Java 1.8.0_101 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-HDFS-Build/17330/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/17330/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs-client hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project Console output https://builds.apache.org/job/PreCommit-HDFS-Build/17330/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          kshukla Kuhu Shukla added a comment -

          Test failures are unrelated. TestPermission is broken by HDFS-10455. TestDataNodeRollingUpgrade is passing locally and the error in the precommit is a bind exception which is not related to this fix.

          Requesting Kihwal Lee, Daryn Sharp and others for more reviews/comments and suggestions. Thanks a lot!

          Show
          kshukla Kuhu Shukla added a comment - Test failures are unrelated. TestPermission is broken by HDFS-10455 . TestDataNodeRollingUpgrade is passing locally and the error in the precommit is a bind exception which is not related to this fix. Requesting Kihwal Lee , Daryn Sharp and others for more reviews/comments and suggestions. Thanks a lot!
          Hide
          kihwal Kihwal Lee added a comment -

          +1 The patch looks good.

          Show
          kihwal Kihwal Lee added a comment - +1 The patch looks good.
          Hide
          kihwal Kihwal Lee added a comment - - edited

          Committed this to trunk, branch-2 and branch-2.8.
          Do you have a patch for 2.7 and 2.6?

          Show
          kihwal Kihwal Lee added a comment - - edited Committed this to trunk, branch-2 and branch-2.8. Do you have a patch for 2.7 and 2.6?
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #10840 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10840/)
          HDFS-8870. Lease is leaked on write failure. Contributed by Kuhu Shukla. (kihwal: rev 4fcea8a0c8019d6d9a5e6f315c83659938b93a40)

          • (edit) hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java
          • (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSOutputStream.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #10840 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10840/ ) HDFS-8870 . Lease is leaked on write failure. Contributed by Kuhu Shukla. (kihwal: rev 4fcea8a0c8019d6d9a5e6f315c83659938b93a40) (edit) hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSOutputStream.java
          Hide
          andrew.wang Andrew Wang added a comment -

          I'm going to resolve this issue for now, please reopen if you want to backport to 2.7 and 2.6 for precommit runs.

          Show
          andrew.wang Andrew Wang added a comment - I'm going to resolve this issue for now, please reopen if you want to backport to 2.7 and 2.6 for precommit runs.

            People

            • Assignee:
              kshukla Kuhu Shukla
              Reporter:
              shahrs87 Rushabh S Shah
            • Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development