Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-11945

Internal lease recovery may not be retried for a long time

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.0.0-alpha4, 2.8.2
    • Component/s: namenode
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      Lease is assigned per client who is identified by its holder ID or client ID, thus a renewal or an expiration of a lease affects all files being written by the client.

      When a client/writer dies without closing a file, its lease expires in one hour (hard limit) and the namenode tries to recover the lease. As a part of the process, the namenode takes the ownership of the lease and renews it. If the recovery does not finish successfully, the lease will expire in one hour and the namenode will try again to recover the lease.

      However, if a file system has another lease expiring within the hour, the recovery attempt for the lease will push forward the expiration of the lease held by the namenode. This causes failed lease recoveries to be not retried for a long time. We have seen it happening for days.

      1. HDFS-11945.trunk.v2.patch
        7 kB
        Kihwal Lee
      2. HDFS-11945.trunk.patch
        7 kB
        Kihwal Lee
      3. HDFS-11945.branch-2.v2.patch
        7 kB
        Kihwal Lee

        Activity

        Hide
        kihwal Kihwal Lee added a comment -

        Thanks, Mingliang Liu!

        Show
        kihwal Kihwal Lee added a comment - Thanks, Mingliang Liu !
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11846 (See https://builds.apache.org/job/Hadoop-trunk-Commit/11846/)
        HDFS-11945. Internal lease recovery may not be retried for a long time. (liuml07: rev 1a33c9d58927186c2f219a5ecb5f1573801823ad)

        • (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestFileTruncate.java
        • (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestLeaseRecovery2.java
        • (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/LeaseManager.java
        • (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestLeaseManager.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11846 (See https://builds.apache.org/job/Hadoop-trunk-Commit/11846/ ) HDFS-11945 . Internal lease recovery may not be retried for a long time. (liuml07: rev 1a33c9d58927186c2f219a5ecb5f1573801823ad) (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestFileTruncate.java (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestLeaseRecovery2.java (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/LeaseManager.java (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestLeaseManager.java
        Hide
        liuml07 Mingliang Liu added a comment -

        Failing tests are not related, and both pass on my local machine. Committed to trunk, branch-2 and branch-2.8 branches. Thanks for your contribution Kihwal Lee.

        Show
        liuml07 Mingliang Liu added a comment - Failing tests are not related, and both pass on my local machine. Committed to trunk , branch-2 and branch-2.8 branches. Thanks for your contribution Kihwal Lee .
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 15s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 test4tests 0m 0s The patch appears to include 3 new or modified test files.
        +1 mvninstall 13m 47s trunk passed
        +1 compile 0m 49s trunk passed
        +1 checkstyle 0m 36s trunk passed
        +1 mvnsite 0m 53s trunk passed
        +1 findbugs 1m 41s trunk passed
        +1 javadoc 0m 39s trunk passed
        +1 mvninstall 0m 48s the patch passed
        +1 compile 0m 46s the patch passed
        +1 javac 0m 46s the patch passed
        +1 checkstyle 0m 35s hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 75 unchanged - 2 fixed = 75 total (was 77)
        +1 mvnsite 0m 53s the patch passed
        +1 whitespace 0m 0s The patch has no whitespace issues.
        +1 findbugs 1m 51s the patch passed
        +1 javadoc 0m 37s the patch passed
        -1 unit 66m 1s hadoop-hdfs in the patch failed.
        +1 asflicense 0m 19s The patch does not generate ASF License warnings.
        91m 48s



        Reason Tests
        Failed junit tests hadoop.hdfs.server.namenode.ha.TestBootstrapStandbyWithQJM
          hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:14b5c93
        JIRA Issue HDFS-11945
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12872107/HDFS-11945.trunk.v2.patch
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux aba8a116f711 3.13.0-116-generic #163-Ubuntu SMP Fri Mar 31 14:13:22 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / a062374
        Default Java 1.8.0_131
        findbugs v3.1.0-RC1
        unit https://builds.apache.org/job/PreCommit-HDFS-Build/19839/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
        Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/19839/testReport/
        modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
        Console output https://builds.apache.org/job/PreCommit-HDFS-Build/19839/console
        Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 15s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 3 new or modified test files. +1 mvninstall 13m 47s trunk passed +1 compile 0m 49s trunk passed +1 checkstyle 0m 36s trunk passed +1 mvnsite 0m 53s trunk passed +1 findbugs 1m 41s trunk passed +1 javadoc 0m 39s trunk passed +1 mvninstall 0m 48s the patch passed +1 compile 0m 46s the patch passed +1 javac 0m 46s the patch passed +1 checkstyle 0m 35s hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 75 unchanged - 2 fixed = 75 total (was 77) +1 mvnsite 0m 53s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 51s the patch passed +1 javadoc 0m 37s the patch passed -1 unit 66m 1s hadoop-hdfs in the patch failed. +1 asflicense 0m 19s The patch does not generate ASF License warnings. 91m 48s Reason Tests Failed junit tests hadoop.hdfs.server.namenode.ha.TestBootstrapStandbyWithQJM   hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting Subsystem Report/Notes Docker Image:yetus/hadoop:14b5c93 JIRA Issue HDFS-11945 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12872107/HDFS-11945.trunk.v2.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux aba8a116f711 3.13.0-116-generic #163-Ubuntu SMP Fri Mar 31 14:13:22 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / a062374 Default Java 1.8.0_131 findbugs v3.1.0-RC1 unit https://builds.apache.org/job/PreCommit-HDFS-Build/19839/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/19839/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/19839/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
        Hide
        kihwal Kihwal Lee added a comment -

        Attaching updated patches. The only difference in branch-2 patch is one additional import.

        Show
        kihwal Kihwal Lee added a comment - Attaching updated patches. The only difference in branch-2 patch is one additional import.
        Hide
        liuml07 Mingliang Liu added a comment -

        I'm +1 on the patch.

        Minor comments:

        1. The internalLeaseHolder value to be concatenated by _ instead of space
        2. The last test statement:
          assertFalse(holder.equals(lm.getInternalLeaseHolder()));
          

          Better to use:

          assertNotEquals("some meaningful message", holder, lm.getInternalLeaseHolder());
          
        Show
        liuml07 Mingliang Liu added a comment - I'm +1 on the patch. Minor comments: The internalLeaseHolder value to be concatenated by _ instead of space The last test statement: assertFalse(holder.equals(lm.getInternalLeaseHolder())); Better to use: assertNotEquals( "some meaningful message" , holder, lm.getInternalLeaseHolder());
        Hide
        kihwal Kihwal Lee added a comment -

        The failed tests all pass when I run them.

        -------------------------------------------------------
         T E S T S
        -------------------------------------------------------
        Running org.apache.hadoop.hdfs.server.blockmanagement.TestBlockTokenWithDFSStriped
        Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 88.762 sec
         - in org.apache.hadoop.hdfs.server.blockmanagement.TestBlockTokenWithDFSStriped
        Running org.apache.hadoop.hdfs.TestDFSStripedOutputStreamWithFailure010
        Tests run: 14, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 170.511 sec
         - in org.apache.hadoop.hdfs.TestDFSStripedOutputStreamWithFailure010
        Running org.apache.hadoop.hdfs.TestDFSStripedOutputStreamWithFailure080
        Tests run: 14, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 106.453 sec
         - in org.apache.hadoop.hdfs.TestDFSStripedOutputStreamWithFailure080
        
        Results :
        
        Tests run: 32, Failures: 0, Errors: 0, Skipped: 0
        
        Show
        kihwal Kihwal Lee added a comment - The failed tests all pass when I run them. ------------------------------------------------------- T E S T S ------------------------------------------------------- Running org.apache.hadoop.hdfs.server.blockmanagement.TestBlockTokenWithDFSStriped Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 88.762 sec - in org.apache.hadoop.hdfs.server.blockmanagement.TestBlockTokenWithDFSStriped Running org.apache.hadoop.hdfs.TestDFSStripedOutputStreamWithFailure010 Tests run: 14, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 170.511 sec - in org.apache.hadoop.hdfs.TestDFSStripedOutputStreamWithFailure010 Running org.apache.hadoop.hdfs.TestDFSStripedOutputStreamWithFailure080 Tests run: 14, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 106.453 sec - in org.apache.hadoop.hdfs.TestDFSStripedOutputStreamWithFailure080 Results : Tests run: 32, Failures: 0, Errors: 0, Skipped: 0
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 15s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 test4tests 0m 0s The patch appears to include 3 new or modified test files.
        +1 mvninstall 13m 44s trunk passed
        +1 compile 0m 48s trunk passed
        +1 checkstyle 0m 35s trunk passed
        +1 mvnsite 0m 52s trunk passed
        +1 findbugs 1m 40s trunk passed
        +1 javadoc 0m 40s trunk passed
        +1 mvninstall 0m 48s the patch passed
        +1 compile 0m 45s the patch passed
        +1 javac 0m 45s the patch passed
        +1 checkstyle 0m 33s hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 75 unchanged - 2 fixed = 75 total (was 77)
        +1 mvnsite 0m 50s the patch passed
        +1 whitespace 0m 0s The patch has no whitespace issues.
        +1 findbugs 1m 44s the patch passed
        +1 javadoc 0m 39s the patch passed
        -1 unit 63m 31s hadoop-hdfs in the patch failed.
        +1 asflicense 0m 20s The patch does not generate ASF License warnings.
        89m 2s



        Reason Tests
        Failed junit tests hadoop.hdfs.TestDFSStripedOutputStreamWithFailure080
          hadoop.hdfs.TestDFSStripedOutputStreamWithFailure010
          hadoop.hdfs.server.blockmanagement.TestBlockTokenWithDFSStriped



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:14b5c93
        JIRA Issue HDFS-11945
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12871868/HDFS-11945.trunk.patch
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux 3a0e209ce470 3.13.0-107-generic #154-Ubuntu SMP Tue Dec 20 09:57:27 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / 24181f5
        Default Java 1.8.0_131
        findbugs v3.1.0-RC1
        unit https://builds.apache.org/job/PreCommit-HDFS-Build/19826/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
        Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/19826/testReport/
        modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
        Console output https://builds.apache.org/job/PreCommit-HDFS-Build/19826/console
        Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 15s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 3 new or modified test files. +1 mvninstall 13m 44s trunk passed +1 compile 0m 48s trunk passed +1 checkstyle 0m 35s trunk passed +1 mvnsite 0m 52s trunk passed +1 findbugs 1m 40s trunk passed +1 javadoc 0m 40s trunk passed +1 mvninstall 0m 48s the patch passed +1 compile 0m 45s the patch passed +1 javac 0m 45s the patch passed +1 checkstyle 0m 33s hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 75 unchanged - 2 fixed = 75 total (was 77) +1 mvnsite 0m 50s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 44s the patch passed +1 javadoc 0m 39s the patch passed -1 unit 63m 31s hadoop-hdfs in the patch failed. +1 asflicense 0m 20s The patch does not generate ASF License warnings. 89m 2s Reason Tests Failed junit tests hadoop.hdfs.TestDFSStripedOutputStreamWithFailure080   hadoop.hdfs.TestDFSStripedOutputStreamWithFailure010   hadoop.hdfs.server.blockmanagement.TestBlockTokenWithDFSStriped Subsystem Report/Notes Docker Image:yetus/hadoop:14b5c93 JIRA Issue HDFS-11945 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12871868/HDFS-11945.trunk.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 3a0e209ce470 3.13.0-107-generic #154-Ubuntu SMP Tue Dec 20 09:57:27 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 24181f5 Default Java 1.8.0_131 findbugs v3.1.0-RC1 unit https://builds.apache.org/job/PreCommit-HDFS-Build/19826/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/19826/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/19826/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
        Hide
        kihwal Kihwal Lee added a comment - - edited

        We could change the namenode lease holder ID every hour. Normally there will be only a brief moment of two being active in the system. Multiple ones can be active If there are failures. If the ID is suffixed by time stamp or date string, the log message for recovery will show how old the leases are.

        The major cause of lease recovery failures is datanodes having problems during block recoveries. One interesting case is when the namenode throws "server too busy" to datanodes. A commitBlockSynchronization() call can fail for this reason and won't be retried. HADOOP-14035 will mitigate this particular case.

        Show
        kihwal Kihwal Lee added a comment - - edited We could change the namenode lease holder ID every hour. Normally there will be only a brief moment of two being active in the system. Multiple ones can be active If there are failures. If the ID is suffixed by time stamp or date string, the log message for recovery will show how old the leases are. The major cause of lease recovery failures is datanodes having problems during block recoveries. One interesting case is when the namenode throws "server too busy" to datanodes. A commitBlockSynchronization() call can fail for this reason and won't be retried. HADOOP-14035 will mitigate this particular case.

          People

          • Assignee:
            kihwal Kihwal Lee
            Reporter:
            kihwal Kihwal Lee
          • Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development