Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-10220

A large number of expired leases can make namenode unresponsive and cause failover

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.8.0, 3.0.0-alpha1
    • Component/s: namenode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Two new configuration have been added "dfs.namenode.lease-recheck-interval-ms" and "dfs.namenode.max-lock-hold-to-release-lease-ms" to fine tune the duty cycle with which the Namenode recovers old leases.

      Description

      I have faced a namenode failover due to unresponsive namenode detected by the zkfc with lot's of WARN messages (5 millions) like this one:
      org.apache.hadoop.hdfs.StateChange: BLOCK* internalReleaseLease: All existing blocks are COMPLETE, lease removed, file closed.

      On the threaddump taken by the zkfc there are lots of thread blocked due to a lock.

      Looking at the code, there are a lock taken by the LeaseManager.Monitor when some lease must be released. Due to the really big number of lease to be released the namenode has taken too many times to release them blocking all other tasks and making the zkfc thinking that the namenode was not available/stuck.

      The idea of this patch is to limit the number of leased released each time we check for lease so the lock won't be taken for a too long time period.

      1. HADOOP-10220.001.patch
        9 kB
        Nicolas Fraison
      2. HADOOP-10220.002.patch
        10 kB
        Nicolas Fraison
      3. HADOOP-10220.003.patch
        11 kB
        Nicolas Fraison
      4. HADOOP-10220.004.patch
        11 kB
        Nicolas Fraison
      5. HADOOP-10220.005.patch
        8 kB
        Nicolas Fraison
      6. HADOOP-10220.006.patch
        8 kB
        Nicolas Fraison
      7. HADOOP-10220.007.patch
        13 kB
        Nicolas Fraison
      8. threaddump_zkfc.txt
        809 kB
        Nicolas Fraison

        Activity

        Hide
        raviprak Ravi Prakash added a comment -

        Thank you for the report Nicolas! Which version of Hadoop did you experience this on?

        Show
        raviprak Ravi Prakash added a comment - Thank you for the report Nicolas! Which version of Hadoop did you experience this on?
        Hide
        nfraison.criteo Nicolas Fraison added a comment -

        On hadoop 2.6.0 from the cdh 5.5.0 package.

        Show
        nfraison.criteo Nicolas Fraison added a comment - On hadoop 2.6.0 from the cdh 5.5.0 package.
        Hide
        raviprak Ravi Prakash added a comment -

        Thanks Nicolas! I am going to resolve this as a duplicate to https://issues.apache.org/jira/browse/HDFS-4882 in that case. It was committed into Hadoop-2.6.1 . Please feel free to re-open this JIRA if I somehow misunderstood your problem. Thanks!

        Show
        raviprak Ravi Prakash added a comment - Thanks Nicolas! I am going to resolve this as a duplicate to https://issues.apache.org/jira/browse/HDFS-4882 in that case. It was committed into Hadoop-2.6.1 . Please feel free to re-open this JIRA if I somehow misunderstood your problem. Thanks!
        Hide
        nfraison.criteo Nicolas Fraison added a comment -

        Hi Ravi,

        Thanks for the feedback but for me the issue we face is not the same than the one indicated in HDFS-4882.
        In fact this patch has been applied in the hadoop 2.6.0 package in cdh 5.4.0 (sorry but cloudera backport some patch from different version of hadoop so I can't provide you the exact hadoop release) and from the spurce code I can confirm that it is applied in cdh5.5.0 also: https://archive.cloudera.com/cdh5/cdh/5/hadoop-2.6.0-cdh5.4.0.releasenotes.html.

        Nicolas

        Show
        nfraison.criteo Nicolas Fraison added a comment - Hi Ravi, Thanks for the feedback but for me the issue we face is not the same than the one indicated in HDFS-4882 . In fact this patch has been applied in the hadoop 2.6.0 package in cdh 5.4.0 (sorry but cloudera backport some patch from different version of hadoop so I can't provide you the exact hadoop release) and from the spurce code I can confirm that it is applied in cdh5.5.0 also: https://archive.cloudera.com/cdh5/cdh/5/hadoop-2.6.0-cdh5.4.0.releasenotes.html . Nicolas
        Hide
        vinayrpet Vinayakumar B added a comment -

        Nicolas Fraison, Did you see the similar messages ( might be < 5 million ) even after failover?
        If there are So many unclosed files, and in between failover happened, then remaining files also might have been closed after failover.

        If that is the case (Having millions of unclosed files from dead clients within an hour), your proposal to limit the number of leases released each time makes sense.

        Show
        vinayrpet Vinayakumar B added a comment - Nicolas Fraison , Did you see the similar messages ( might be < 5 million ) even after failover? If there are So many unclosed files, and in between failover happened, then remaining files also might have been closed after failover. If that is the case (Having millions of unclosed files from dead clients within an hour), your proposal to limit the number of leases released each time makes sense.
        Hide
        raviprak Ravi Prakash added a comment -

        Oh! I'm sorry in that case. Thank you for re-opening the JIRA. Just out of curiosity, how many files did you have open at one time? Do you think we should cycle through all the leases rather than the same set every iteration? We may be over-engineering, but I'd be interested in your opinion. I'll review your patch shortly

        Show
        raviprak Ravi Prakash added a comment - Oh! I'm sorry in that case. Thank you for re-opening the JIRA. Just out of curiosity, how many files did you have open at one time? Do you think we should cycle through all the leases rather than the same set every iteration? We may be over-engineering, but I'd be interested in your opinion. I'll review your patch shortly
        Hide
        raviprak Ravi Prakash added a comment -
        import static org.apache.hadoop.hdfs.DFSConfigKeys.*;

        Could you please import classes explicitly?

          /** Number max of path for released lease each time Monitor check for expired lease */
          private final long maxPathRealeaseExpiredLease;
        

        has grammar and spelling errors. I'd suggest

          /** Maximum number of files whose lease will be released in one iteration of checkLeases() */
          private final long maxPathReleaseExpiredLease; // <-- Release was misspelt here
        
            Configuration conf = new Configuration();
            this.maxPathRealeaseExpiredLease = conf.getLong(DFS_NAMENODE_MAX_PATH_RELEASE_EXPIRED_LEASE_KEY,
              DFS_NAMENODE_MAX_PATH_RELEASE_EXPIRED_LEASE_DEFAULT);
        

        I'm fine with not getting maxPathRealeaseExpiredLease from configuration and hardcoding it to your default value of 100000. If you want to keep the configuration, I'd suggest changing dfs.namenode.max-path-release-expired-lease to dfs.namenode.lease-manager.max-released-leases-per-iteration .

        Please rename nPathReleaseExpiredLease to numLeasesReleased

        // Stop releasing lease as a lock is hold after few iterations

        change to

         //Relinquish FSNamesystem lock after maxPathRealeaseExpiredLease iterations 
                LOG.warn("Breaking out of checkLeases() after " + nPathReleaseExpiredLease + " iterations",
                  new Throwable("Too long loop with a lock"));
        

        Its unnecessary to log an exception.

        For the purposes of test, you can add a method which changes the maxPathRealeaseExpiredLease and annotate it Could you please also also rename testCheckLeaseNotInfiniteLoop and change its documentation.

        Show
        raviprak Ravi Prakash added a comment - import static org.apache.hadoop.hdfs.DFSConfigKeys.*; Could you please import classes explicitly? /** Number max of path for released lease each time Monitor check for expired lease */ private final long maxPathRealeaseExpiredLease; has grammar and spelling errors. I'd suggest /** Maximum number of files whose lease will be released in one iteration of checkLeases() */ private final long maxPathReleaseExpiredLease; // <-- Release was misspelt here Configuration conf = new Configuration(); this .maxPathRealeaseExpiredLease = conf.getLong(DFS_NAMENODE_MAX_PATH_RELEASE_EXPIRED_LEASE_KEY, DFS_NAMENODE_MAX_PATH_RELEASE_EXPIRED_LEASE_DEFAULT); I'm fine with not getting maxPathRealeaseExpiredLease from configuration and hardcoding it to your default value of 100000. If you want to keep the configuration, I'd suggest changing dfs.namenode.max-path-release-expired-lease to dfs.namenode.lease-manager.max-released-leases-per-iteration . Please rename nPathReleaseExpiredLease to numLeasesReleased // Stop releasing lease as a lock is hold after few iterations change to //Relinquish FSNamesystem lock after maxPathRealeaseExpiredLease iterations LOG.warn( "Breaking out of checkLeases() after " + nPathReleaseExpiredLease + " iterations" , new Throwable( "Too long loop with a lock" )); Its unnecessary to log an exception. For the purposes of test, you can add a method which changes the maxPathRealeaseExpiredLease and annotate it Could you please also also rename testCheckLeaseNotInfiniteLoop and change its documentation.
        Hide
        raviprak Ravi Prakash added a comment -

        Also lets rename maxPathReleaseExpiredLease to maxFilesCheckedForRelease (because completed may be false in some cases.

        Show
        raviprak Ravi Prakash added a comment - Also lets rename maxPathReleaseExpiredLease to maxFilesCheckedForRelease (because completed may be false in some cases.
        Hide
        nfraison.criteo Nicolas Fraison added a comment -

        Vinayakumar B the day we face this kind of failover we have faced multiple failover with the same issue on both namenodes. It happens after a bad action on the cluster removing the mapreduce.jobhistory.intermediate-done-dir folder whith then lots of mapreduce failing...

        Since we have applied this patch we have one time reached 250K lease to release taking a total time of 45 seconds (having 100 k lease treated per cycle).

        Show
        nfraison.criteo Nicolas Fraison added a comment - Vinayakumar B the day we face this kind of failover we have faced multiple failover with the same issue on both namenodes. It happens after a bad action on the cluster removing the mapreduce.jobhistory.intermediate-done-dir folder whith then lots of mapreduce failing... Since we have applied this patch we have one time reached 250K lease to release taking a total time of 45 seconds (having 100 k lease treated per cycle).
        Hide
        nfraison.criteo Nicolas Fraison added a comment -

        Ravi Prakash this issue seems to only happens when we have for different reason a big MR job or a big number of map-reduce jobs abnormally failing (nso not well closing lease).
        It is quite difficult to tll you number of open files at there are lots of different jobs running on the platform (up to 100K per days) with different behaviour. Lot's of them are distcp jobs with a max of 300 files open for write by each distcp. So we should have a maximum of 20-40K files open for write at a time.

        Do you think we should cycle through all the leases rather than the same set every iteration? We may be over-engineering, but I'd be interested in your opinion
        In fact the patch is quite simple and will cycle on the same set. I think that it will be better to cycle through all expired lease on multiple iteration. Unless we consider that checking for 100k lease is a safe value and that we should not cycle on the same set as some will be released.

        Please rename nPathReleaseExpiredLease to numLeasesReleased
        I will rename it numLeasesPathsReleased as it is the number of paths treated in the loop releasing lease.

        Show
        nfraison.criteo Nicolas Fraison added a comment - Ravi Prakash this issue seems to only happens when we have for different reason a big MR job or a big number of map-reduce jobs abnormally failing (nso not well closing lease). It is quite difficult to tll you number of open files at there are lots of different jobs running on the platform (up to 100K per days) with different behaviour. Lot's of them are distcp jobs with a max of 300 files open for write by each distcp. So we should have a maximum of 20-40K files open for write at a time. Do you think we should cycle through all the leases rather than the same set every iteration? We may be over-engineering, but I'd be interested in your opinion In fact the patch is quite simple and will cycle on the same set. I think that it will be better to cycle through all expired lease on multiple iteration. Unless we consider that checking for 100k lease is a safe value and that we should not cycle on the same set as some will be released. Please rename nPathReleaseExpiredLease to numLeasesReleased I will rename it numLeasesPathsReleased as it is the number of paths treated in the loop releasing lease.
        Hide
        nfraison.criteo Nicolas Fraison added a comment -

        Here is the 2nd version of the patch taking in account your comments.

        Let's see then if we would prefer to not iterate on same set of lease.

        Show
        nfraison.criteo Nicolas Fraison added a comment - Here is the 2nd version of the patch taking in account your comments. Let's see then if we would prefer to not iterate on same set of lease.
        Hide
        vinayrpet Vinayakumar B added a comment -
        +  public static final String  DFS_NAMENODE_MAX_RELEASED_LEASES_PER_ITERATION_KEY = "dfs.namenode.lease-manager.max-released-leases-per-iteration";
        +  public static final long    DFS_NAMENODE_MAX_RELEASED_LEASES_PER_ITERATION_DEFAULT = 100000;
        

        I think prop name could be "dfs.namenode.lease-manager.max-files-checked-per-iteration", as all might not be released immediately. For some block recovery will be triggered.
        And I think 100000, default value is quite high. Since these are recovery of unclosed files from dead clients from NameNode explicitly, No need to hurry.
        I feel 1000 would be enough. Similar to BLOCK_DELETION_INCREMENT for removal of large number of blocks.

        +  LeaseManager(FSNamesystem fsnamesystem) {
        +    this.fsnamesystem = fsnamesystem;
        +    Configuration conf = new Configuration();
        +    this.maxFilesCheckedForRelease = conf.getLong(DFS_NAMENODE_MAX_RELEASED_LEASES_PER_ITERATION_KEY,
        +      DFS_NAMENODE_MAX_RELEASED_LEASES_PER_ITERATION_DEFAULT);
        +  }
        

        Creating new Configuration instance, will always load configurations from xml files. This would make it impossible to change values on the fly for tests. It would be better to pass Configuration from FSNamesystem during creation of LeaseManager.

        +  @VisibleForTesting
        +  public void setMaxFilesCheckedForRelease(long maxFilesCheckedForRelease) {
        +    this.maxFilesCheckedForRelease = maxFilesCheckedForRelease;
        +  }

        When the conf is passed from FSNamesystem, this may not be required. Test can change the configuration itself.

        +      long numLeasesPathsReleased = 0;
        .
        .
        .
        +      numLeasesPathsReleased += numLeasePaths;
        +      //Relinquish FSNamesystem lock after maxPathRealeaseExpiredLease iterations
        +      if (numLeasesPathsReleased >= maxFilesCheckedForRelease) {
        +        LOG.warn("Breaking out of checkLeases() after " + numLeasesPathsReleased + " iterations");
        +        break;
        +      }
        

        This will not ensure that loop will come out exactly after the max number of leases released. If one client itself have more number of open files, then this may not be correct.
        It can be modified as below.

        @@ -356,7 +363,9 @@ synchronized boolean checkLeases() {
             boolean needSync = false;
             assert fsnamesystem.hasWriteLock();
         
        -    while(!sortedLeases.isEmpty() && sortedLeases.peek().expiredHardLimit()) {
        +    int filesLeasesChecked = 0;
        +    while (!sortedLeases.isEmpty() && sortedLeases.peek().expiredHardLimit()
        +        && filesLeasesChecked < maxFilesCheckedForLeaseRecovery) {
               Lease leaseToCheck = sortedLeases.poll();
               LOG.info(leaseToCheck + " has expired hard limit");
         
        @@ -396,6 +405,14 @@ synchronized boolean checkLeases() {
                   LOG.error("Cannot release the path " + p + " in the lease "
                       + leaseToCheck, e);
                   removing.add(id);
        +        } finally{
        +          filesLeasesChecked++;
        +          if (filesLeasesChecked >= maxFilesCheckedForLeaseRecovery) {
        +            LOG.warn(
        +                "Breaking out of checkLeases() after " + filesLeasesChecked
        +                    + " file leases checked.");
        +            break;
        +          }
                 }
               }
        
        Show
        vinayrpet Vinayakumar B added a comment - + public static final String DFS_NAMENODE_MAX_RELEASED_LEASES_PER_ITERATION_KEY = "dfs.namenode.lease-manager.max-released-leases-per-iteration" ; + public static final long DFS_NAMENODE_MAX_RELEASED_LEASES_PER_ITERATION_DEFAULT = 100000; I think prop name could be "dfs.namenode.lease-manager.max-files-checked-per-iteration", as all might not be released immediately. For some block recovery will be triggered. And I think 100000, default value is quite high. Since these are recovery of unclosed files from dead clients from NameNode explicitly, No need to hurry. I feel 1000 would be enough. Similar to BLOCK_DELETION_INCREMENT for removal of large number of blocks. + LeaseManager(FSNamesystem fsnamesystem) { + this .fsnamesystem = fsnamesystem; + Configuration conf = new Configuration(); + this .maxFilesCheckedForRelease = conf.getLong(DFS_NAMENODE_MAX_RELEASED_LEASES_PER_ITERATION_KEY, + DFS_NAMENODE_MAX_RELEASED_LEASES_PER_ITERATION_DEFAULT); + } Creating new Configuration instance, will always load configurations from xml files. This would make it impossible to change values on the fly for tests. It would be better to pass Configuration from FSNamesystem during creation of LeaseManager. + @VisibleForTesting + public void setMaxFilesCheckedForRelease( long maxFilesCheckedForRelease) { + this .maxFilesCheckedForRelease = maxFilesCheckedForRelease; + } When the conf is passed from FSNamesystem, this may not be required. Test can change the configuration itself. + long numLeasesPathsReleased = 0; . . . + numLeasesPathsReleased += numLeasePaths; + //Relinquish FSNamesystem lock after maxPathRealeaseExpiredLease iterations + if (numLeasesPathsReleased >= maxFilesCheckedForRelease) { + LOG.warn( "Breaking out of checkLeases() after " + numLeasesPathsReleased + " iterations" ); + break ; + } This will not ensure that loop will come out exactly after the max number of leases released. If one client itself have more number of open files, then this may not be correct. It can be modified as below. @@ -356,7 +363,9 @@ synchronized boolean checkLeases() { boolean needSync = false ; assert fsnamesystem.hasWriteLock(); - while (!sortedLeases.isEmpty() && sortedLeases.peek().expiredHardLimit()) { + int filesLeasesChecked = 0; + while (!sortedLeases.isEmpty() && sortedLeases.peek().expiredHardLimit() + && filesLeasesChecked < maxFilesCheckedForLeaseRecovery) { Lease leaseToCheck = sortedLeases.poll(); LOG.info(leaseToCheck + " has expired hard limit" ); @@ -396,6 +405,14 @@ synchronized boolean checkLeases() { LOG.error( "Cannot release the path " + p + " in the lease " + leaseToCheck, e); removing.add(id); + } finally { + filesLeasesChecked++; + if (filesLeasesChecked >= maxFilesCheckedForLeaseRecovery) { + LOG.warn( + "Breaking out of checkLeases() after " + filesLeasesChecked + + " file leases checked." ); + break ; + } } }
        Hide
        nfraison.criteo Nicolas Fraison added a comment -

        Thanks for the comments[Vinayakumar B]. Here is the new patch taking in account your comments.
        [Ravi Prakash] there is one mistake in my previous comment we must consider if 1k files checked is sufficiently safe for not iterating on the same leases (not on 1k leases).

        Show
        nfraison.criteo Nicolas Fraison added a comment - Thanks for the comments [Vinayakumar B] . Here is the new patch taking in account your comments. [Ravi Prakash] there is one mistake in my previous comment we must consider if 1k files checked is sufficiently safe for not iterating on the same leases (not on 1k leases).
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 20s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
        +1 mvninstall 9m 41s trunk passed
        +1 compile 1m 15s trunk passed with JDK v1.8.0_77
        +1 compile 1m 3s trunk passed with JDK v1.7.0_95
        +1 checkstyle 0m 39s trunk passed
        +1 mvnsite 1m 13s trunk passed
        +1 mvneclipse 0m 21s trunk passed
        +1 findbugs 2m 29s trunk passed
        +1 javadoc 1m 40s trunk passed with JDK v1.8.0_77
        +1 javadoc 2m 40s trunk passed with JDK v1.7.0_95
        +1 mvninstall 1m 7s the patch passed
        +1 compile 1m 8s the patch passed with JDK v1.8.0_77
        +1 javac 1m 8s the patch passed
        +1 compile 0m 55s the patch passed with JDK v1.7.0_95
        +1 javac 0m 55s the patch passed
        -1 checkstyle 0m 37s hadoop-hdfs-project/hadoop-hdfs: patch generated 9 new + 603 unchanged - 2 fixed = 612 total (was 605)
        +1 mvnsite 1m 8s the patch passed
        +1 mvneclipse 0m 15s the patch passed
        -1 whitespace 0m 0s The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix.
        +1 xml 0m 2s The patch has no ill-formed XML file.
        +1 findbugs 2m 38s the patch passed
        +1 javadoc 1m 37s the patch passed with JDK v1.8.0_77
        +1 javadoc 2m 34s the patch passed with JDK v1.7.0_95
        -1 unit 75m 57s hadoop-hdfs in the patch failed with JDK v1.8.0_77.
        -1 unit 0m 26s hadoop-hdfs in the patch failed with JDK v1.7.0_95.
        -1 asflicense 0m 29s Patch generated 1 ASF License warnings.
        113m 53s



        Reason Tests
        JDK v1.8.0_77 Failed junit tests hadoop.hdfs.server.datanode.TestDirectoryScanner
          hadoop.hdfs.server.mover.TestStorageMover
          hadoop.hdfs.server.namenode.ha.TestEditLogTailer
          hadoop.hdfs.server.datanode.TestDataNodeUUID
          hadoop.hdfs.security.TestDelegationTokenForProxyUser
          hadoop.hdfs.TestFileAppend
        JDK v1.8.0_77 Timed out junit tests org.apache.hadoop.hdfs.TestFileCreationClient
          org.apache.hadoop.hdfs.server.balancer.TestBalancer
          org.apache.hadoop.hdfs.TestReadStripedFileWithDecoding
          org.apache.hadoop.hdfs.server.datanode.TestDirectoryScanner



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:fbe3e86
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12796655/HADOOP-10220.003.patch
        JIRA Issue HDFS-10220
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle xml
        uname Linux e84713098b43 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / 1ff27f9
        Default Java 1.7.0_95
        Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_77 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95
        findbugs v3.0.0
        checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/15131/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt
        whitespace https://builds.apache.org/job/PreCommit-HDFS-Build/15131/artifact/patchprocess/whitespace-eol.txt
        unit https://builds.apache.org/job/PreCommit-HDFS-Build/15131/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_77.txt
        unit https://builds.apache.org/job/PreCommit-HDFS-Build/15131/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt
        unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/15131/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_77.txt
        JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15131/testReport/
        asflicense https://builds.apache.org/job/PreCommit-HDFS-Build/15131/artifact/patchprocess/patch-asflicense-problems.txt
        modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
        Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15131/console
        Powered by Apache Yetus 0.2.0 http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 20s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 9m 41s trunk passed +1 compile 1m 15s trunk passed with JDK v1.8.0_77 +1 compile 1m 3s trunk passed with JDK v1.7.0_95 +1 checkstyle 0m 39s trunk passed +1 mvnsite 1m 13s trunk passed +1 mvneclipse 0m 21s trunk passed +1 findbugs 2m 29s trunk passed +1 javadoc 1m 40s trunk passed with JDK v1.8.0_77 +1 javadoc 2m 40s trunk passed with JDK v1.7.0_95 +1 mvninstall 1m 7s the patch passed +1 compile 1m 8s the patch passed with JDK v1.8.0_77 +1 javac 1m 8s the patch passed +1 compile 0m 55s the patch passed with JDK v1.7.0_95 +1 javac 0m 55s the patch passed -1 checkstyle 0m 37s hadoop-hdfs-project/hadoop-hdfs: patch generated 9 new + 603 unchanged - 2 fixed = 612 total (was 605) +1 mvnsite 1m 8s the patch passed +1 mvneclipse 0m 15s the patch passed -1 whitespace 0m 0s The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix. +1 xml 0m 2s The patch has no ill-formed XML file. +1 findbugs 2m 38s the patch passed +1 javadoc 1m 37s the patch passed with JDK v1.8.0_77 +1 javadoc 2m 34s the patch passed with JDK v1.7.0_95 -1 unit 75m 57s hadoop-hdfs in the patch failed with JDK v1.8.0_77. -1 unit 0m 26s hadoop-hdfs in the patch failed with JDK v1.7.0_95. -1 asflicense 0m 29s Patch generated 1 ASF License warnings. 113m 53s Reason Tests JDK v1.8.0_77 Failed junit tests hadoop.hdfs.server.datanode.TestDirectoryScanner   hadoop.hdfs.server.mover.TestStorageMover   hadoop.hdfs.server.namenode.ha.TestEditLogTailer   hadoop.hdfs.server.datanode.TestDataNodeUUID   hadoop.hdfs.security.TestDelegationTokenForProxyUser   hadoop.hdfs.TestFileAppend JDK v1.8.0_77 Timed out junit tests org.apache.hadoop.hdfs.TestFileCreationClient   org.apache.hadoop.hdfs.server.balancer.TestBalancer   org.apache.hadoop.hdfs.TestReadStripedFileWithDecoding   org.apache.hadoop.hdfs.server.datanode.TestDirectoryScanner Subsystem Report/Notes Docker Image:yetus/hadoop:fbe3e86 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12796655/HADOOP-10220.003.patch JIRA Issue HDFS-10220 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle xml uname Linux e84713098b43 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 1ff27f9 Default Java 1.7.0_95 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_77 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/15131/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt whitespace https://builds.apache.org/job/PreCommit-HDFS-Build/15131/artifact/patchprocess/whitespace-eol.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/15131/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_77.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/15131/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/15131/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_77.txt JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15131/testReport/ asflicense https://builds.apache.org/job/PreCommit-HDFS-Build/15131/artifact/patchprocess/patch-asflicense-problems.txt modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15131/console Powered by Apache Yetus 0.2.0 http://yetus.apache.org This message was automatically generated.
        Hide
        vinayrpet Vinayakumar B added a comment -

        Thanks @Nicolas Fraison for the patch.
        Looks almost good. Some comments below.

        1. "DFS_NAMENODE_MAX_FILES_CHECKED_PER_ITERATION_KEY" could be "DFS_NAMENODE_LEASE_MAX_FILES_CHECKED_PER_ITERATION_KEY", same for DEFAULT as well.
        2.

        	  /** Maximum number of files whose lease will be released in one iteration
        	   *  of LeaseManager.checkLeases()
        	   */
        	  private final int maxFilesLeasesCheckedForRelease;
        

        Comment could be corrected, saying 'number of files checked to release lease in one iteration'
        3.

        	          if (filesLeasesChecked >= fsnamesystem.getMaxFilesLeasesCheckedForRelease()) {
        	            removeFilesInLease(leaseToCheck, removing);
        	            LOG.warn("Breaking out of checkLeases() after " + filesLeasesChecked
        	              + "  file leases checked.");
        	            break;
        	          }
        

        Here, removeFilesInLease(leaseToCheck, removing); may not be required. as just breaking out from for-loop will do the removal outside while-loop. So extracting to separate method also may not be necessary.

        4.

        -      Long[] leaseINodeIds = files.toArray(new Long[files.size()]);
        +      final int numLeasePaths = files.size();
        +      Long[] leaseINodeIds = files.toArray(new Long[numLeasePaths]);
        

        This also may not be required.

        5. checkstyle comments can be addressed.

        +1, once these are addressed.

        Show
        vinayrpet Vinayakumar B added a comment - Thanks @Nicolas Fraison for the patch. Looks almost good. Some comments below. 1. "DFS_NAMENODE_MAX_FILES_CHECKED_PER_ITERATION_KEY" could be "DFS_NAMENODE_LEASE_MAX_FILES_CHECKED_PER_ITERATION_KEY", same for DEFAULT as well. 2. /** Maximum number of files whose lease will be released in one iteration * of LeaseManager.checkLeases() */ private final int maxFilesLeasesCheckedForRelease; Comment could be corrected, saying 'number of files checked to release lease in one iteration' 3. if (filesLeasesChecked >= fsnamesystem.getMaxFilesLeasesCheckedForRelease()) { removeFilesInLease(leaseToCheck, removing); LOG.warn( "Breaking out of checkLeases() after " + filesLeasesChecked + " file leases checked." ); break ; } Here, removeFilesInLease(leaseToCheck, removing); may not be required. as just breaking out from for-loop will do the removal outside while-loop. So extracting to separate method also may not be necessary. 4. - Long [] leaseINodeIds = files.toArray( new Long [files.size()]); + final int numLeasePaths = files.size(); + Long [] leaseINodeIds = files.toArray( new Long [numLeasePaths]); This also may not be required. 5. checkstyle comments can be addressed. +1, once these are addressed.
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 15s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
        +1 mvninstall 7m 48s trunk passed
        +1 compile 0m 59s trunk passed with JDK v1.8.0_77
        +1 compile 0m 48s trunk passed with JDK v1.7.0_95
        +1 checkstyle 0m 27s trunk passed
        +1 mvnsite 0m 57s trunk passed
        +1 mvneclipse 0m 14s trunk passed
        +1 findbugs 2m 4s trunk passed
        +1 javadoc 1m 20s trunk passed with JDK v1.8.0_77
        +1 javadoc 2m 1s trunk passed with JDK v1.7.0_95
        +1 mvninstall 0m 53s the patch passed
        +1 compile 0m 53s the patch passed with JDK v1.8.0_77
        +1 javac 0m 53s the patch passed
        +1 compile 0m 48s the patch passed with JDK v1.7.0_95
        +1 javac 0m 48s the patch passed
        -1 checkstyle 0m 27s hadoop-hdfs-project/hadoop-hdfs: patch generated 1 new + 601 unchanged - 4 fixed = 602 total (was 605)
        +1 mvnsite 0m 57s the patch passed
        +1 mvneclipse 0m 11s the patch passed
        +1 whitespace 0m 0s Patch has no whitespace issues.
        +1 xml 0m 1s The patch has no ill-formed XML file.
        +1 findbugs 2m 19s the patch passed
        +1 javadoc 1m 18s the patch passed with JDK v1.8.0_77
        +1 javadoc 2m 2s the patch passed with JDK v1.7.0_95
        -1 unit 107m 30s hadoop-hdfs in the patch failed with JDK v1.8.0_77.
        -1 unit 98m 50s hadoop-hdfs in the patch failed with JDK v1.7.0_95.
        +1 asflicense 0m 26s Patch does not generate ASF License warnings.
        236m 9s



        Reason Tests
        JDK v1.8.0_77 Failed junit tests hadoop.hdfs.server.datanode.TestDataNodeUUID
          hadoop.hdfs.server.namenode.TestDiskspaceQuotaUpdate
          hadoop.hdfs.server.namenode.TestNestedEncryptionZones
          hadoop.hdfs.TestDFSStripedOutputStreamWithFailure
          hadoop.hdfs.TestReadStripedFileWithMissingBlocks
          hadoop.hdfs.server.namenode.snapshot.TestSnapshotDeletion
          hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl
          hadoop.hdfs.TestRenameWhileOpen
          hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead
          hadoop.hdfs.security.TestDelegationTokenForProxyUser
          hadoop.hdfs.server.namenode.TestSecondaryNameNodeUpgrade
          hadoop.hdfs.server.namenode.metrics.TestNameNodeMetrics
          hadoop.hdfs.server.namenode.ha.TestHAAppend
          hadoop.hdfs.server.namenode.TestNamenodeCapacityReport
          hadoop.hdfs.server.namenode.TestAddOverReplicatedStripedBlocks
          hadoop.hdfs.server.namenode.TestNameNodeRespectsBindHostKeys
          hadoop.hdfs.server.datanode.TestDirectoryScanner
        JDK v1.8.0_77 Timed out junit tests org.apache.hadoop.hdfs.TestWriteReadStripedFile
          org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing
          org.apache.hadoop.hdfs.TestReadStripedFileWithDecoding
        JDK v1.7.0_95 Failed junit tests hadoop.hdfs.TestDFSStripedOutputStreamWithFailure
          hadoop.hdfs.server.namenode.TestNameNodeMetadataConsistency
          hadoop.hdfs.TestReadStripedFileWithMissingBlocks
          hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots
          hadoop.hdfs.TestModTime
          hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl
          hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead
          hadoop.hdfs.server.blockmanagement.TestBlocksWithNotEnoughRacks
          hadoop.hdfs.server.datanode.TestDirectoryScanner
        JDK v1.7.0_95 Timed out junit tests org.apache.hadoop.hdfs.TestWriteReadStripedFile
          org.apache.hadoop.hdfs.TestReadStripedFileWithDecoding



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:fbe3e86
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12798961/HADOOP-10220.004.patch
        JIRA Issue HDFS-10220
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle xml
        uname Linux f458e5d8514f 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / 6e6b6dd
        Default Java 1.7.0_95
        Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_77 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95
        findbugs v3.0.0
        checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/15170/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt
        unit https://builds.apache.org/job/PreCommit-HDFS-Build/15170/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_77.txt
        unit https://builds.apache.org/job/PreCommit-HDFS-Build/15170/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt
        unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/15170/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_77.txt https://builds.apache.org/job/PreCommit-HDFS-Build/15170/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt
        JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15170/testReport/
        modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
        Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15170/console
        Powered by Apache Yetus 0.2.0 http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 15s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 7m 48s trunk passed +1 compile 0m 59s trunk passed with JDK v1.8.0_77 +1 compile 0m 48s trunk passed with JDK v1.7.0_95 +1 checkstyle 0m 27s trunk passed +1 mvnsite 0m 57s trunk passed +1 mvneclipse 0m 14s trunk passed +1 findbugs 2m 4s trunk passed +1 javadoc 1m 20s trunk passed with JDK v1.8.0_77 +1 javadoc 2m 1s trunk passed with JDK v1.7.0_95 +1 mvninstall 0m 53s the patch passed +1 compile 0m 53s the patch passed with JDK v1.8.0_77 +1 javac 0m 53s the patch passed +1 compile 0m 48s the patch passed with JDK v1.7.0_95 +1 javac 0m 48s the patch passed -1 checkstyle 0m 27s hadoop-hdfs-project/hadoop-hdfs: patch generated 1 new + 601 unchanged - 4 fixed = 602 total (was 605) +1 mvnsite 0m 57s the patch passed +1 mvneclipse 0m 11s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 xml 0m 1s The patch has no ill-formed XML file. +1 findbugs 2m 19s the patch passed +1 javadoc 1m 18s the patch passed with JDK v1.8.0_77 +1 javadoc 2m 2s the patch passed with JDK v1.7.0_95 -1 unit 107m 30s hadoop-hdfs in the patch failed with JDK v1.8.0_77. -1 unit 98m 50s hadoop-hdfs in the patch failed with JDK v1.7.0_95. +1 asflicense 0m 26s Patch does not generate ASF License warnings. 236m 9s Reason Tests JDK v1.8.0_77 Failed junit tests hadoop.hdfs.server.datanode.TestDataNodeUUID   hadoop.hdfs.server.namenode.TestDiskspaceQuotaUpdate   hadoop.hdfs.server.namenode.TestNestedEncryptionZones   hadoop.hdfs.TestDFSStripedOutputStreamWithFailure   hadoop.hdfs.TestReadStripedFileWithMissingBlocks   hadoop.hdfs.server.namenode.snapshot.TestSnapshotDeletion   hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl   hadoop.hdfs.TestRenameWhileOpen   hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead   hadoop.hdfs.security.TestDelegationTokenForProxyUser   hadoop.hdfs.server.namenode.TestSecondaryNameNodeUpgrade   hadoop.hdfs.server.namenode.metrics.TestNameNodeMetrics   hadoop.hdfs.server.namenode.ha.TestHAAppend   hadoop.hdfs.server.namenode.TestNamenodeCapacityReport   hadoop.hdfs.server.namenode.TestAddOverReplicatedStripedBlocks   hadoop.hdfs.server.namenode.TestNameNodeRespectsBindHostKeys   hadoop.hdfs.server.datanode.TestDirectoryScanner JDK v1.8.0_77 Timed out junit tests org.apache.hadoop.hdfs.TestWriteReadStripedFile   org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing   org.apache.hadoop.hdfs.TestReadStripedFileWithDecoding JDK v1.7.0_95 Failed junit tests hadoop.hdfs.TestDFSStripedOutputStreamWithFailure   hadoop.hdfs.server.namenode.TestNameNodeMetadataConsistency   hadoop.hdfs.TestReadStripedFileWithMissingBlocks   hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots   hadoop.hdfs.TestModTime   hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl   hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead   hadoop.hdfs.server.blockmanagement.TestBlocksWithNotEnoughRacks   hadoop.hdfs.server.datanode.TestDirectoryScanner JDK v1.7.0_95 Timed out junit tests org.apache.hadoop.hdfs.TestWriteReadStripedFile   org.apache.hadoop.hdfs.TestReadStripedFileWithDecoding Subsystem Report/Notes Docker Image:yetus/hadoop:fbe3e86 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12798961/HADOOP-10220.004.patch JIRA Issue HDFS-10220 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle xml uname Linux f458e5d8514f 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 6e6b6dd Default Java 1.7.0_95 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_77 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/15170/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/15170/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_77.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/15170/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/15170/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_77.txt https://builds.apache.org/job/PreCommit-HDFS-Build/15170/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15170/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15170/console Powered by Apache Yetus 0.2.0 http://yetus.apache.org This message was automatically generated.
        Hide
        nfraison.criteo Nicolas Fraison added a comment -

        Vinayakumar B, there is one checkstyle issue remaining but on a line that is not changed by the patch and comments have been addressed.
        Let me know if the patch can be merged.

        Show
        nfraison.criteo Nicolas Fraison added a comment - Vinayakumar B , there is one checkstyle issue remaining but on a line that is not changed by the patch and comments have been addressed. Let me know if the patch can be merged.
        Hide
        walter.k.su Walter Su added a comment -

        1. isMaxFilesCheckedToReleaseLease is not requirted to be a function.
        2. repeat Vinayakumar B said, removeFilesInLease(leaseToCheck, removing); may not be required.
        3. The LOG.warn("..") is kind of verbose.
        4. I think the config should keep inside. It's about implemention detail. The re-check interval is 2s and is hard-coded too. Besides, It's too complicated for user to pick a right value. Instead of counting the files, I prefer counting the time. If it holds the lock for too long, log a warning and break out for a while.
        5. btw, HDFS-9311 should solve this issue.

        Show
        walter.k.su Walter Su added a comment - 1. isMaxFilesCheckedToReleaseLease is not requirted to be a function. 2. repeat Vinayakumar B said, removeFilesInLease(leaseToCheck, removing); may not be required. 3. The LOG.warn("..") is kind of verbose. 4. I think the config should keep inside. It's about implemention detail. The re-check interval is 2s and is hard-coded too. Besides, It's too complicated for user to pick a right value. Instead of counting the files, I prefer counting the time. If it holds the lock for too long, log a warning and break out for a while. 5. btw, HDFS-9311 should solve this issue.
        Hide
        nfraison.criteo Nicolas Fraison added a comment -

        Walter Su
        1. I think that it simplify the code
        2. Ok I will removed this change as we only called it once since it is removed from loop lease break
        3. Ok I will change it to DEBUG
        4. Counting the time since better in term of funcionnality but I'm afraid about adding extra computation time on this check compare to a simple count of files. The idea is not to spend more times to release those lease. What is your feeling about it?
        5. It should solve the failover issue but it will let the lock being held during this release of lease, slowing donw/stucking operation on the namenode

        Show
        nfraison.criteo Nicolas Fraison added a comment - Walter Su 1. I think that it simplify the code 2. Ok I will removed this change as we only called it once since it is removed from loop lease break 3. Ok I will change it to DEBUG 4. Counting the time since better in term of funcionnality but I'm afraid about adding extra computation time on this check compare to a simple count of files. The idea is not to spend more times to release those lease. What is your feeling about it? 5. It should solve the failover issue but it will let the lock being held during this release of lease, slowing donw/stucking operation on the namenode
        Hide
        raviprak Ravi Prakash added a comment -

        It's about implemention detail

        I've heard that argument before and I honestly don't know what that means. We have a lot of "implementation details" which are tunable via configurations. I don't see anything bad about that. The user is not expected to know anything about those configurations, cluster administrators are! Without these configurations, cluster admins have no knobs to turn to fix issues they see.

        Show
        raviprak Ravi Prakash added a comment - It's about implemention detail I've heard that argument before and I honestly don't know what that means. We have a lot of "implementation details" which are tunable via configurations. I don't see anything bad about that. The user is not expected to know anything about those configurations, cluster administrators are! Without these configurations, cluster admins have no knobs to turn to fix issues they see.
        Hide
        raviprak Ravi Prakash added a comment -

        Anyway! I am not going to be a stickler on this. In the interest of getting the patch committed I'm fine with either way.

        Show
        raviprak Ravi Prakash added a comment - Anyway! I am not going to be a stickler on this. In the interest of getting the patch committed I'm fine with either way.
        Hide
        walter.k.su Walter Su added a comment -

        You are right. The only question I have is I have no idea if the default value 1000 is a right choice, or the approach of throttling the rate. I kind of hope it's out-of-the-box. Small companies with small clusters have cluster administrators who may not quite understand what the configuration means.

        Counting the time since better in term of funcionnality but I'm afraid about adding extra computation time on this check compare to a simple count of files. The idea is not to spend more times to release those lease. What is your feeling about it?

        I believe the overhead can be ignored. Or we can calc the elapse time after processing a small batch.

        I saw BlockManager.BlockReportProcessingThread release the writeLock if it holds it more than 4ms. Do you think the same idea works here?

        Show
        walter.k.su Walter Su added a comment - You are right. The only question I have is I have no idea if the default value 1000 is a right choice, or the approach of throttling the rate. I kind of hope it's out-of-the-box. Small companies with small clusters have cluster administrators who may not quite understand what the configuration means. Counting the time since better in term of funcionnality but I'm afraid about adding extra computation time on this check compare to a simple count of files. The idea is not to spend more times to release those lease. What is your feeling about it? I believe the overhead can be ignored. Or we can calc the elapse time after processing a small batch. I saw BlockManager.BlockReportProcessingThread release the writeLock if it holds it more than 4ms. Do you think the same idea works here?
        Hide
        walter.k.su Walter Su added a comment -

        I mean, saving administrators the trouble to tune this.

        Show
        walter.k.su Walter Su added a comment - I mean, saving administrators the trouble to tune this.
        Hide
        raviprak Ravi Prakash added a comment -

        Thanks Walter! I too would favor a timeout, (maybe checking after small batches of 1k files) .

        Thanks Nicolas Fraison! Could you please upload a patch addressing Walter's comments that you agreed to 4 days ago? I will leave it up to you to whether to incorporate the timeout in this patch or not. If not, we can file a follow-on JIRA. Let's get this in.

        Show
        raviprak Ravi Prakash added a comment - Thanks Walter! I too would favor a timeout, (maybe checking after small batches of 1k files) . Thanks Nicolas Fraison ! Could you please upload a patch addressing Walter's comments that you agreed to 4 days ago? I will leave it up to you to whether to incorporate the timeout in this patch or not. If not, we can file a follow-on JIRA. Let's get this in.
        Hide
        nfraison.criteo Nicolas Fraison added a comment -

        Here is a new release of the patch taking in account comments.
        Thanks Walter Su for pointing me to BlockReportProcessingThread way of checking the duration of the lock being hold. It is clearly not a too much extra comuting time solution.

        Show
        nfraison.criteo Nicolas Fraison added a comment - Here is a new release of the patch taking in account comments. Thanks Walter Su for pointing me to BlockReportProcessingThread way of checking the duration of the lock being hold. It is clearly not a too much extra comuting time solution.
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 22s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
        0 mvndep 0m 22s Maven dependency ordering for branch
        +1 mvninstall 7m 6s trunk passed
        +1 compile 1m 20s trunk passed with JDK v1.8.0_92
        +1 compile 1m 22s trunk passed with JDK v1.7.0_95
        +1 checkstyle 0m 30s trunk passed
        +1 mvnsite 1m 32s trunk passed
        +1 mvneclipse 0m 30s trunk passed
        +1 findbugs 3m 39s trunk passed
        +1 javadoc 1m 29s trunk passed with JDK v1.8.0_92
        +1 javadoc 2m 13s trunk passed with JDK v1.7.0_95
        0 mvndep 0m 9s Maven dependency ordering for patch
        +1 mvninstall 1m 20s the patch passed
        +1 compile 1m 15s the patch passed with JDK v1.8.0_92
        +1 javac 1m 15s the patch passed
        +1 compile 1m 21s the patch passed with JDK v1.7.0_95
        +1 javac 1m 21s the patch passed
        +1 checkstyle 0m 27s hadoop-hdfs-project: patch generated 0 new + 211 unchanged - 5 fixed = 211 total (was 216)
        +1 mvnsite 1m 24s the patch passed
        +1 mvneclipse 0m 22s the patch passed
        +1 whitespace 0m 0s Patch has no whitespace issues.
        +1 findbugs 4m 5s the patch passed
        +1 javadoc 1m 23s the patch passed with JDK v1.8.0_92
        +1 javadoc 2m 8s the patch passed with JDK v1.7.0_95
        +1 unit 0m 53s hadoop-hdfs-client in the patch passed with JDK v1.8.0_92.
        -1 unit 69m 41s hadoop-hdfs in the patch failed with JDK v1.8.0_92.
        +1 unit 0m 59s hadoop-hdfs-client in the patch passed with JDK v1.7.0_95.
        -1 unit 74m 6s hadoop-hdfs in the patch failed with JDK v1.7.0_95.
        +1 asflicense 0m 22s Patch does not generate ASF License warnings.
        182m 45s



        Reason Tests
        JDK v1.8.0_92 Failed junit tests hadoop.hdfs.TestHFlush
          hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl
        JDK v1.7.0_95 Failed junit tests hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:fbe3e86
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12800710/HADOOP-10220.005.patch
        JIRA Issue HDFS-10220
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux 35daba647b3c 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / a5fed8b
        Default Java 1.7.0_95
        Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_92 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95
        findbugs v3.0.0
        unit https://builds.apache.org/job/PreCommit-HDFS-Build/15288/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_92.txt
        unit https://builds.apache.org/job/PreCommit-HDFS-Build/15288/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt
        unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/15288/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_92.txt https://builds.apache.org/job/PreCommit-HDFS-Build/15288/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt
        JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15288/testReport/
        modules C: hadoop-hdfs-project/hadoop-hdfs-client hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project
        Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15288/console
        Powered by Apache Yetus 0.2.0 http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 22s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. 0 mvndep 0m 22s Maven dependency ordering for branch +1 mvninstall 7m 6s trunk passed +1 compile 1m 20s trunk passed with JDK v1.8.0_92 +1 compile 1m 22s trunk passed with JDK v1.7.0_95 +1 checkstyle 0m 30s trunk passed +1 mvnsite 1m 32s trunk passed +1 mvneclipse 0m 30s trunk passed +1 findbugs 3m 39s trunk passed +1 javadoc 1m 29s trunk passed with JDK v1.8.0_92 +1 javadoc 2m 13s trunk passed with JDK v1.7.0_95 0 mvndep 0m 9s Maven dependency ordering for patch +1 mvninstall 1m 20s the patch passed +1 compile 1m 15s the patch passed with JDK v1.8.0_92 +1 javac 1m 15s the patch passed +1 compile 1m 21s the patch passed with JDK v1.7.0_95 +1 javac 1m 21s the patch passed +1 checkstyle 0m 27s hadoop-hdfs-project: patch generated 0 new + 211 unchanged - 5 fixed = 211 total (was 216) +1 mvnsite 1m 24s the patch passed +1 mvneclipse 0m 22s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 findbugs 4m 5s the patch passed +1 javadoc 1m 23s the patch passed with JDK v1.8.0_92 +1 javadoc 2m 8s the patch passed with JDK v1.7.0_95 +1 unit 0m 53s hadoop-hdfs-client in the patch passed with JDK v1.8.0_92. -1 unit 69m 41s hadoop-hdfs in the patch failed with JDK v1.8.0_92. +1 unit 0m 59s hadoop-hdfs-client in the patch passed with JDK v1.7.0_95. -1 unit 74m 6s hadoop-hdfs in the patch failed with JDK v1.7.0_95. +1 asflicense 0m 22s Patch does not generate ASF License warnings. 182m 45s Reason Tests JDK v1.8.0_92 Failed junit tests hadoop.hdfs.TestHFlush   hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl JDK v1.7.0_95 Failed junit tests hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl Subsystem Report/Notes Docker Image:yetus/hadoop:fbe3e86 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12800710/HADOOP-10220.005.patch JIRA Issue HDFS-10220 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 35daba647b3c 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / a5fed8b Default Java 1.7.0_95 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_92 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-HDFS-Build/15288/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_92.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/15288/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/15288/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_92.txt https://builds.apache.org/job/PreCommit-HDFS-Build/15288/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15288/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs-client hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15288/console Powered by Apache Yetus 0.2.0 http://yetus.apache.org This message was automatically generated.
        Hide
        walter.k.su Walter Su added a comment -

        Thanks Nicolas Fraison for the update.
        repeat one of my previous comment:isMaxLockHoldToReleaseLease doesn't need to be a function. Is it because it's called twice per iteration? I think check it once per iteration would be enough.

        Show
        walter.k.su Walter Su added a comment - Thanks Nicolas Fraison for the update. repeat one of my previous comment: isMaxLockHoldToReleaseLease doesn't need to be a function. Is it because it's called twice per iteration? I think check it once per iteration would be enough.
        Hide
        nfraison.criteo Nicolas Fraison added a comment -

        Walter Su will also repeat my feedback on this comment I think it add some readability and also because it is used twice.
        I prefer to check it twice as I don't know how much path can be stored on a lease and I would like to not being stucked releasing path of one lease.
        If you are sure that number of path per lease can't be a blocking point we can in fact change this and only cheked it once per iteration.

        Show
        nfraison.criteo Nicolas Fraison added a comment - Walter Su will also repeat my feedback on this comment I think it add some readability and also because it is used twice. I prefer to check it twice as I don't know how much path can be stored on a lease and I would like to not being stucked releasing path of one lease. If you are sure that number of path per lease can't be a blocking point we can in fact change this and only cheked it once per iteration.
        Hide
        raviprak Ravi Prakash added a comment -

        Thanks Nicolas! One nit: In MAX_LOCK_HOLD_TO_RELEASE_LAESE_MS you've misspelt LEASE . Could you please fix it?

        If there are no more objections, I'll commit the patch once you fix the nit.

        Show
        raviprak Ravi Prakash added a comment - Thanks Nicolas! One nit: In MAX_LOCK_HOLD_TO_RELEASE_LAESE_MS you've misspelt LEASE . Could you please fix it? If there are no more objections, I'll commit the patch once you fix the nit.
        Hide
        walter.k.su Walter Su added a comment -

        I think it add some readability and also because it is used twice.

        I only took a peek last time. Yeah, i'm ok with that.
        Another problem when I go through the details,

            while(!sortedLeases.isEmpty() && sortedLeases.peek().expiredHardLimit()
              && !isMaxLockHoldToReleaseLease(start)) {
              Lease leaseToCheck = sortedLeases.poll();
              ...
              Collection<Long> files = leaseToCheck.getFiles();
             ...
              for(Long id : leaseINodeIds) {
                ...
                } finally {
                  filesLeasesChecked++;
                  if (isMaxLockHoldToReleaseLease(start)) {
                    LOG.debug("Breaking out of checkLeases() after " +
                        filesLeasesChecked + " file leases checked.");
                    break;
                  }
              }
        

        You can't just break the inside for-loop, the leaseToCheck has been polled out of the queue. This will cause some files won't be closed.

        Show
        walter.k.su Walter Su added a comment - I think it add some readability and also because it is used twice. I only took a peek last time. Yeah, i'm ok with that. Another problem when I go through the details, while (!sortedLeases.isEmpty() && sortedLeases.peek().expiredHardLimit() && !isMaxLockHoldToReleaseLease(start)) { Lease leaseToCheck = sortedLeases.poll(); ... Collection< Long > files = leaseToCheck.getFiles(); ... for ( Long id : leaseINodeIds) { ... } finally { filesLeasesChecked++; if (isMaxLockHoldToReleaseLease(start)) { LOG.debug( "Breaking out of checkLeases() after " + filesLeasesChecked + " file leases checked." ); break ; } } You can't just break the inside for-loop, the leaseToCheck has been polled out of the queue. This will cause some files won't be closed.
        Hide
        raviprak Ravi Prakash added a comment -

        Good catch Walter! That seems broken prior to this patch too, doesn't it? Haohui Mai Could you please comment on whether leaseToCheck should be put back into sortedLeases in case completed is false . Otherwise, the lease will never be checked again. Or perhaps I am not understanding some other mechanism through which it would.

        Show
        raviprak Ravi Prakash added a comment - Good catch Walter! That seems broken prior to this patch too, doesn't it? Haohui Mai Could you please comment on whether leaseToCheck should be put back into sortedLeases in case completed is false . Otherwise, the lease will never be checked again. Or perhaps I am not understanding some other mechanism through which it would.
        Hide
        nfraison.criteo Nicolas Fraison added a comment -

        Thanks Walter Su for the catch.
        We can update the break like that to avoid this issue:

                } finally {
                  if (isMaxLockHoldToReleaseLease(start)) {
                    LOG.debug("Breaking out of checkLeases() after " +
                      maxLockHoldToReleaseLease + "ms.");
                    if (leaseToCheck.hasFiles()) {
                      renewLease(leaseToCheck);
                    }
                    break;
                  }
                }
        
        Show
        nfraison.criteo Nicolas Fraison added a comment - Thanks Walter Su for the catch. We can update the break like that to avoid this issue: } finally { if (isMaxLockHoldToReleaseLease(start)) { LOG.debug( "Breaking out of checkLeases() after " + maxLockHoldToReleaseLease + "ms." ); if (leaseToCheck.hasFiles()) { renewLease(leaseToCheck); } break ; } }
        Hide
        raviprak Ravi Prakash added a comment -

        Is there a need to renew the lease? That would prevent it from being checked for another LEASE_HARDLIMIT_PERIOD. Could we simply change Lease leaseToCheck = sortedLeases.poll(); to Lease leaseToCheck = sortedLeases.peek();? And add it to removing in case completed is true. I'd also propose to remove the removing.add(id); from the catch block. Haohui Mai, Jing Zhao you made these changes in HDFS-6757 . Do you have any comments?

        Show
        raviprak Ravi Prakash added a comment - Is there a need to renew the lease? That would prevent it from being checked for another LEASE_HARDLIMIT_PERIOD . Could we simply change Lease leaseToCheck = sortedLeases.poll(); to Lease leaseToCheck = sortedLeases.peek(); ? And add it to removing in case completed is true. I'd also propose to remove the removing.add(id); from the catch block. Haohui Mai , Jing Zhao you made these changes in HDFS-6757 . Do you have any comments?
        Hide
        nfraison.criteo Nicolas Fraison added a comment -

        Ravi Prakash in fact changing from Lease leaseToCheck = sortedLeases.poll(); to Lease leaseToCheck = sortedLeases.peek(); seems better.
        But I don't think that we had to add the lease to removing in case of completed as it has already been removed by fsnamesystem.internalReleaseLease in finalizeINodeFileUnderConstruction.
        I also think that we should keep the removing.add(id); in case of an exception, as they seems to be raised (and removed) for good reason:

              // Only the last and the penultimate blocks may be in non COMPLETE state.
              // If the penultimate block is not COMPLETE, then it must be COMMITTED.
              .... 
              throw new IOException(message);
        

        and

              // Cannot close file right now, since some blocks 
              // are not yet minimally replicated.
              // This may potentially cause infinite loop in lease recovery
              // if there are no valid replicas on data-nodes.
              ...
              throw new AlreadyBeingCreatedException(message);
        

        What is your feeling about this?

        Show
        nfraison.criteo Nicolas Fraison added a comment - Ravi Prakash in fact changing from Lease leaseToCheck = sortedLeases.poll(); to Lease leaseToCheck = sortedLeases.peek(); seems better. But I don't think that we had to add the lease to removing in case of completed as it has already been removed by fsnamesystem.internalReleaseLease in finalizeINodeFileUnderConstruction . I also think that we should keep the removing.add(id); in case of an exception, as they seems to be raised (and removed) for good reason: // Only the last and the penultimate blocks may be in non COMPLETE state. // If the penultimate block is not COMPLETE, then it must be COMMITTED. .... throw new IOException(message); and // Cannot close file right now, since some blocks // are not yet minimally replicated. // This may potentially cause infinite loop in lease recovery // if there are no valid replicas on data-nodes. ... throw new AlreadyBeingCreatedException(message); What is your feeling about this?
        Hide
        raviprak Ravi Prakash added a comment -

        Thanks Nicolas! You're amazing
        I think I would like to hear opinions from some of the other people like Kihwal Lee and Yongjun Zhang. I'd be wary of throwing away leases for which recovery did not successfully finish

        Show
        raviprak Ravi Prakash added a comment - Thanks Nicolas! You're amazing I think I would like to hear opinions from some of the other people like Kihwal Lee and Yongjun Zhang . I'd be wary of throwing away leases for which recovery did not successfully finish
        Hide
        liuml07 Mingliang Liu added a comment -

        I also think changing from Lease leaseToCheck = sortedLeases.poll(); to Lease leaseToCheck = sortedLeases.peek(); will address Walter Su's comment. Moreover, we can move the statements in finally block out of it (instead, put them after the try-catch). I'm not favor of "breaking" a upper-level loop in the finally block and I was hinted by ERR04-J..

        Other than this, I have some nits:

        1. isMaxLockHoldToReleaseLease can be private
        2. In the test, according to assertEquals(expected, actual) signature, we need reduce confusing test failing message.
          - assertEquals(lm.countLease(), numLease);
          + assertEquals(numLease, lm.countLease());
          
        3. We may still need the javadoc for MAX_LOCK_HOLD_TO_RELEASE_LAESE_MS
        Show
        liuml07 Mingliang Liu added a comment - I also think changing from Lease leaseToCheck = sortedLeases.poll(); to Lease leaseToCheck = sortedLeases.peek(); will address Walter Su 's comment. Moreover, we can move the statements in finally block out of it (instead, put them after the try-catch). I'm not favor of "breaking" a upper-level loop in the finally block and I was hinted by ERR04-J. . Other than this, I have some nits: isMaxLockHoldToReleaseLease can be private In the test, according to assertEquals(expected, actual) signature, we need reduce confusing test failing message. - assertEquals(lm.countLease(), numLease); + assertEquals(numLease, lm.countLease()); We may still need the javadoc for MAX_LOCK_HOLD_TO_RELEASE_LAESE_MS
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 11s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
        0 mvndep 0m 25s Maven dependency ordering for branch
        +1 mvninstall 6m 47s trunk passed
        +1 compile 1m 14s trunk passed with JDK v1.8.0_91
        +1 compile 1m 19s trunk passed with JDK v1.7.0_95
        +1 checkstyle 0m 27s trunk passed
        +1 mvnsite 1m 24s trunk passed
        +1 mvneclipse 0m 26s trunk passed
        +1 findbugs 3m 31s trunk passed
        +1 javadoc 1m 25s trunk passed with JDK v1.8.0_91
        +1 javadoc 2m 13s trunk passed with JDK v1.7.0_95
        0 mvndep 0m 9s Maven dependency ordering for patch
        +1 mvninstall 1m 15s the patch passed
        +1 compile 1m 17s the patch passed with JDK v1.8.0_91
        +1 javac 1m 17s the patch passed
        +1 compile 1m 19s the patch passed with JDK v1.7.0_95
        +1 javac 1m 19s the patch passed
        -1 checkstyle 0m 28s hadoop-hdfs-project: patch generated 2 new + 212 unchanged - 4 fixed = 214 total (was 216)
        +1 mvnsite 1m 25s the patch passed
        +1 mvneclipse 0m 23s the patch passed
        +1 whitespace 0m 0s Patch has no whitespace issues.
        +1 findbugs 4m 13s the patch passed
        +1 javadoc 1m 25s the patch passed with JDK v1.8.0_91
        +1 javadoc 2m 7s the patch passed with JDK v1.7.0_95
        +1 unit 0m 50s hadoop-hdfs-client in the patch passed with JDK v1.8.0_91.
        -1 unit 60m 50s hadoop-hdfs in the patch failed with JDK v1.8.0_91.
        +1 unit 0m 57s hadoop-hdfs-client in the patch passed with JDK v1.7.0_95.
        +1 unit 56m 13s hadoop-hdfs in the patch passed with JDK v1.7.0_95.
        +1 asflicense 0m 24s Patch does not generate ASF License warnings.
        154m 53s



        Reason Tests
        JDK v1.8.0_91 Failed junit tests hadoop.hdfs.server.namenode.ha.TestDFSUpgradeWithHA
          hadoop.hdfs.TestDFSClientRetries
          hadoop.hdfs.server.namenode.TestEditLog
          hadoop.hdfs.TestCrcCorruption
          hadoop.hdfs.TestFileAppend



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:cf2ee45
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12802667/HADOOP-10220.006.patch
        JIRA Issue HDFS-10220
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux 88376d2f4833 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / 2835f14
        Default Java 1.7.0_95
        Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_91 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95
        findbugs v3.0.0
        checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/15385/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project.txt
        unit https://builds.apache.org/job/PreCommit-HDFS-Build/15385/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_91.txt
        unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/15385/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_91.txt
        JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15385/testReport/
        modules C: hadoop-hdfs-project/hadoop-hdfs-client hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project
        Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15385/console
        Powered by Apache Yetus 0.2.0 http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 11s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. 0 mvndep 0m 25s Maven dependency ordering for branch +1 mvninstall 6m 47s trunk passed +1 compile 1m 14s trunk passed with JDK v1.8.0_91 +1 compile 1m 19s trunk passed with JDK v1.7.0_95 +1 checkstyle 0m 27s trunk passed +1 mvnsite 1m 24s trunk passed +1 mvneclipse 0m 26s trunk passed +1 findbugs 3m 31s trunk passed +1 javadoc 1m 25s trunk passed with JDK v1.8.0_91 +1 javadoc 2m 13s trunk passed with JDK v1.7.0_95 0 mvndep 0m 9s Maven dependency ordering for patch +1 mvninstall 1m 15s the patch passed +1 compile 1m 17s the patch passed with JDK v1.8.0_91 +1 javac 1m 17s the patch passed +1 compile 1m 19s the patch passed with JDK v1.7.0_95 +1 javac 1m 19s the patch passed -1 checkstyle 0m 28s hadoop-hdfs-project: patch generated 2 new + 212 unchanged - 4 fixed = 214 total (was 216) +1 mvnsite 1m 25s the patch passed +1 mvneclipse 0m 23s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 findbugs 4m 13s the patch passed +1 javadoc 1m 25s the patch passed with JDK v1.8.0_91 +1 javadoc 2m 7s the patch passed with JDK v1.7.0_95 +1 unit 0m 50s hadoop-hdfs-client in the patch passed with JDK v1.8.0_91. -1 unit 60m 50s hadoop-hdfs in the patch failed with JDK v1.8.0_91. +1 unit 0m 57s hadoop-hdfs-client in the patch passed with JDK v1.7.0_95. +1 unit 56m 13s hadoop-hdfs in the patch passed with JDK v1.7.0_95. +1 asflicense 0m 24s Patch does not generate ASF License warnings. 154m 53s Reason Tests JDK v1.8.0_91 Failed junit tests hadoop.hdfs.server.namenode.ha.TestDFSUpgradeWithHA   hadoop.hdfs.TestDFSClientRetries   hadoop.hdfs.server.namenode.TestEditLog   hadoop.hdfs.TestCrcCorruption   hadoop.hdfs.TestFileAppend Subsystem Report/Notes Docker Image:yetus/hadoop:cf2ee45 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12802667/HADOOP-10220.006.patch JIRA Issue HDFS-10220 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 88376d2f4833 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 2835f14 Default Java 1.7.0_95 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_91 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/15385/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/15385/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_91.txt unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/15385/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_91.txt JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15385/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs-client hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15385/console Powered by Apache Yetus 0.2.0 http://yetus.apache.org This message was automatically generated.
        Hide
        raviprak Ravi Prakash added a comment -

        Thanks for the patches Nicolas and the reviews Mingliang Liu and Walter Su!

        The latest patch looks good to me. If there are no objections I will commit it by EOD today

        Show
        raviprak Ravi Prakash added a comment - Thanks for the patches Nicolas and the reviews Mingliang Liu and Walter Su! The latest patch looks good to me. If there are no objections I will commit it by EOD today
        Hide
        raviprak Ravi Prakash added a comment -

        Wait, I take that back. Mingliang's comment about swapping the order of (lm.countLease(), numLeases) hasn't been addressed. Also, we are still removing the lease when FSNamesystem.internalReleaseLease throws an IOException.

        Show
        raviprak Ravi Prakash added a comment - Wait, I take that back. Mingliang's comment about swapping the order of (lm.countLease(), numLeases) hasn't been addressed. Also, we are still removing the lease when FSNamesystem.internalReleaseLease throws an IOException.
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 11s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
        0 mvndep 0m 10s Maven dependency ordering for branch
        +1 mvninstall 6m 41s trunk passed
        +1 compile 1m 15s trunk passed with JDK v1.8.0_91
        +1 compile 1m 20s trunk passed with JDK v1.7.0_95
        +1 checkstyle 0m 29s trunk passed
        +1 mvnsite 1m 28s trunk passed
        +1 mvneclipse 0m 25s trunk passed
        +1 findbugs 3m 39s trunk passed
        +1 javadoc 1m 37s trunk passed with JDK v1.8.0_91
        +1 javadoc 2m 19s trunk passed with JDK v1.7.0_95
        0 mvndep 0m 9s Maven dependency ordering for patch
        +1 mvninstall 1m 21s the patch passed
        +1 compile 1m 46s the patch passed with JDK v1.8.0_91
        +1 javac 1m 46s the patch passed
        +1 compile 1m 20s the patch passed with JDK v1.7.0_95
        +1 javac 1m 20s the patch passed
        +1 checkstyle 0m 26s hadoop-hdfs-project: patch generated 0 new + 212 unchanged - 4 fixed = 212 total (was 216)
        +1 mvnsite 1m 18s the patch passed
        +1 mvneclipse 0m 22s the patch passed
        +1 whitespace 0m 0s Patch has no whitespace issues.
        +1 findbugs 4m 2s the patch passed
        +1 javadoc 1m 21s the patch passed with JDK v1.8.0_91
        +1 javadoc 2m 7s the patch passed with JDK v1.7.0_95
        +1 unit 0m 50s hadoop-hdfs-client in the patch passed with JDK v1.8.0_91.
        -1 unit 63m 38s hadoop-hdfs in the patch failed with JDK v1.8.0_91.
        +1 unit 0m 58s hadoop-hdfs-client in the patch passed with JDK v1.7.0_95.
        -1 unit 57m 50s hadoop-hdfs in the patch failed with JDK v1.7.0_95.
        +1 asflicense 0m 25s Patch does not generate ASF License warnings.
        159m 42s



        Reason Tests
        JDK v1.8.0_91 Failed junit tests hadoop.hdfs.TestAsyncDFSRename
          hadoop.hdfs.server.datanode.TestNNHandlesBlockReportPerStorage
          hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl
        JDK v1.7.0_95 Failed junit tests hadoop.hdfs.TestAsyncDFSRename
          hadoop.hdfs.server.datanode.TestDataNodeMultipleRegistrations



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:cf2ee45
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12802705/HADOOP-10220.006.patch
        JIRA Issue HDFS-10220
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux 461bc10ce9fc 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / 2835f14
        Default Java 1.7.0_95
        Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_91 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95
        findbugs v3.0.0
        unit https://builds.apache.org/job/PreCommit-HDFS-Build/15388/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_91.txt
        unit https://builds.apache.org/job/PreCommit-HDFS-Build/15388/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt
        unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/15388/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_91.txt https://builds.apache.org/job/PreCommit-HDFS-Build/15388/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt
        JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15388/testReport/
        modules C: hadoop-hdfs-project/hadoop-hdfs-client hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project
        Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15388/console
        Powered by Apache Yetus 0.2.0 http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 11s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. 0 mvndep 0m 10s Maven dependency ordering for branch +1 mvninstall 6m 41s trunk passed +1 compile 1m 15s trunk passed with JDK v1.8.0_91 +1 compile 1m 20s trunk passed with JDK v1.7.0_95 +1 checkstyle 0m 29s trunk passed +1 mvnsite 1m 28s trunk passed +1 mvneclipse 0m 25s trunk passed +1 findbugs 3m 39s trunk passed +1 javadoc 1m 37s trunk passed with JDK v1.8.0_91 +1 javadoc 2m 19s trunk passed with JDK v1.7.0_95 0 mvndep 0m 9s Maven dependency ordering for patch +1 mvninstall 1m 21s the patch passed +1 compile 1m 46s the patch passed with JDK v1.8.0_91 +1 javac 1m 46s the patch passed +1 compile 1m 20s the patch passed with JDK v1.7.0_95 +1 javac 1m 20s the patch passed +1 checkstyle 0m 26s hadoop-hdfs-project: patch generated 0 new + 212 unchanged - 4 fixed = 212 total (was 216) +1 mvnsite 1m 18s the patch passed +1 mvneclipse 0m 22s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 findbugs 4m 2s the patch passed +1 javadoc 1m 21s the patch passed with JDK v1.8.0_91 +1 javadoc 2m 7s the patch passed with JDK v1.7.0_95 +1 unit 0m 50s hadoop-hdfs-client in the patch passed with JDK v1.8.0_91. -1 unit 63m 38s hadoop-hdfs in the patch failed with JDK v1.8.0_91. +1 unit 0m 58s hadoop-hdfs-client in the patch passed with JDK v1.7.0_95. -1 unit 57m 50s hadoop-hdfs in the patch failed with JDK v1.7.0_95. +1 asflicense 0m 25s Patch does not generate ASF License warnings. 159m 42s Reason Tests JDK v1.8.0_91 Failed junit tests hadoop.hdfs.TestAsyncDFSRename   hadoop.hdfs.server.datanode.TestNNHandlesBlockReportPerStorage   hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl JDK v1.7.0_95 Failed junit tests hadoop.hdfs.TestAsyncDFSRename   hadoop.hdfs.server.datanode.TestDataNodeMultipleRegistrations Subsystem Report/Notes Docker Image:yetus/hadoop:cf2ee45 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12802705/HADOOP-10220.006.patch JIRA Issue HDFS-10220 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 461bc10ce9fc 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 2835f14 Default Java 1.7.0_95 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_91 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-HDFS-Build/15388/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_91.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/15388/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/15388/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_91.txt https://builds.apache.org/job/PreCommit-HDFS-Build/15388/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15388/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs-client hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15388/console Powered by Apache Yetus 0.2.0 http://yetus.apache.org This message was automatically generated.
        Hide
        nfraison.criteo Nicolas Fraison added a comment -

        Ravi Prakash, As I said in my previous comment I really think that we should still release the lease in case of an IOException. It seems that each time we have this exception or an exception extending it we don't want to keep the lease.
        I let you and people having made the change double check this.

        Show
        nfraison.criteo Nicolas Fraison added a comment - Ravi Prakash , As I said in my previous comment I really think that we should still release the lease in case of an IOException. It seems that each time we have this exception or an exception extending it we don't want to keep the lease. I let you and people having made the change double check this.
        Hide
        liuml07 Mingliang Liu added a comment -

        Hi Nicolas Fraison, you don't have to change the jira status every time you upload a new patch. Just upload it.

        Show
        liuml07 Mingliang Liu added a comment - Hi Nicolas Fraison , you don't have to change the jira status every time you upload a new patch. Just upload it.
        Hide
        nfraison.criteo Nicolas Fraison added a comment -

        Mingliang Liu, tanks for the notification. The hadoop documentation I found for contribution (https://wiki.apache.org/hadoop/HowToContribute#Contributing_your_work) was not clear about it:

        Should your patch receive a "-1" from the Jenkins testing, select the Cancel Patch on the issue's Jira, upload a new patch with necessary fixes, and then select the Submit Patch link again. 
        
        Show
        nfraison.criteo Nicolas Fraison added a comment - Mingliang Liu , tanks for the notification. The hadoop documentation I found for contribution ( https://wiki.apache.org/hadoop/HowToContribute#Contributing_your_work ) was not clear about it: Should your patch receive a "-1" from the Jenkins testing, select the Cancel Patch on the issue's Jira, upload a new patch with necessary fixes, and then select the Submit Patch link again.
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 13s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
        0 mvndep 0m 10s Maven dependency ordering for branch
        +1 mvninstall 7m 20s trunk passed
        +1 compile 1m 23s trunk passed with JDK v1.8.0_91
        +1 compile 1m 25s trunk passed with JDK v1.7.0_95
        +1 checkstyle 0m 30s trunk passed
        +1 mvnsite 1m 33s trunk passed
        +1 mvneclipse 0m 26s trunk passed
        +1 findbugs 3m 46s trunk passed
        +1 javadoc 1m 31s trunk passed with JDK v1.8.0_91
        +1 javadoc 2m 19s trunk passed with JDK v1.7.0_95
        0 mvndep 0m 9s Maven dependency ordering for patch
        +1 mvninstall 1m 25s the patch passed
        +1 compile 1m 23s the patch passed with JDK v1.8.0_91
        +1 javac 1m 23s the patch passed
        +1 compile 1m 27s the patch passed with JDK v1.7.0_95
        +1 javac 1m 27s the patch passed
        +1 checkstyle 0m 27s hadoop-hdfs-project: patch generated 0 new + 212 unchanged - 4 fixed = 212 total (was 216)
        +1 mvnsite 1m 23s the patch passed
        +1 mvneclipse 0m 23s the patch passed
        +1 whitespace 0m 0s Patch has no whitespace issues.
        +1 findbugs 4m 22s the patch passed
        +1 javadoc 1m 27s the patch passed with JDK v1.8.0_91
        +1 javadoc 2m 5s the patch passed with JDK v1.7.0_95
        +1 unit 0m 49s hadoop-hdfs-client in the patch passed with JDK v1.8.0_91.
        -1 unit 57m 17s hadoop-hdfs in the patch failed with JDK v1.8.0_91.
        +1 unit 0m 58s hadoop-hdfs-client in the patch passed with JDK v1.7.0_95.
        -1 unit 0m 29s hadoop-hdfs in the patch failed with JDK v1.7.0_95.
        +1 asflicense 0m 20s Patch does not generate ASF License warnings.
        97m 24s



        Reason Tests
        JDK v1.8.0_91 Failed junit tests hadoop.hdfs.server.balancer.TestBalancer
          hadoop.hdfs.server.datanode.TestDataNodeLifeline



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:cf2ee45
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12802739/HADOOP-10220.006.patch
        JIRA Issue HDFS-10220
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux cef391b6ec1d 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / b2ed6ae
        Default Java 1.7.0_95
        Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_91 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95
        findbugs v3.0.0
        unit https://builds.apache.org/job/PreCommit-HDFS-Build/15389/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_91.txt
        unit https://builds.apache.org/job/PreCommit-HDFS-Build/15389/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt
        unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/15389/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_91.txt
        JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15389/testReport/
        modules C: hadoop-hdfs-project/hadoop-hdfs-client hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project
        Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15389/console
        Powered by Apache Yetus 0.2.0 http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 13s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. 0 mvndep 0m 10s Maven dependency ordering for branch +1 mvninstall 7m 20s trunk passed +1 compile 1m 23s trunk passed with JDK v1.8.0_91 +1 compile 1m 25s trunk passed with JDK v1.7.0_95 +1 checkstyle 0m 30s trunk passed +1 mvnsite 1m 33s trunk passed +1 mvneclipse 0m 26s trunk passed +1 findbugs 3m 46s trunk passed +1 javadoc 1m 31s trunk passed with JDK v1.8.0_91 +1 javadoc 2m 19s trunk passed with JDK v1.7.0_95 0 mvndep 0m 9s Maven dependency ordering for patch +1 mvninstall 1m 25s the patch passed +1 compile 1m 23s the patch passed with JDK v1.8.0_91 +1 javac 1m 23s the patch passed +1 compile 1m 27s the patch passed with JDK v1.7.0_95 +1 javac 1m 27s the patch passed +1 checkstyle 0m 27s hadoop-hdfs-project: patch generated 0 new + 212 unchanged - 4 fixed = 212 total (was 216) +1 mvnsite 1m 23s the patch passed +1 mvneclipse 0m 23s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 findbugs 4m 22s the patch passed +1 javadoc 1m 27s the patch passed with JDK v1.8.0_91 +1 javadoc 2m 5s the patch passed with JDK v1.7.0_95 +1 unit 0m 49s hadoop-hdfs-client in the patch passed with JDK v1.8.0_91. -1 unit 57m 17s hadoop-hdfs in the patch failed with JDK v1.8.0_91. +1 unit 0m 58s hadoop-hdfs-client in the patch passed with JDK v1.7.0_95. -1 unit 0m 29s hadoop-hdfs in the patch failed with JDK v1.7.0_95. +1 asflicense 0m 20s Patch does not generate ASF License warnings. 97m 24s Reason Tests JDK v1.8.0_91 Failed junit tests hadoop.hdfs.server.balancer.TestBalancer   hadoop.hdfs.server.datanode.TestDataNodeLifeline Subsystem Report/Notes Docker Image:yetus/hadoop:cf2ee45 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12802739/HADOOP-10220.006.patch JIRA Issue HDFS-10220 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux cef391b6ec1d 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / b2ed6ae Default Java 1.7.0_95 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_91 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-HDFS-Build/15389/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_91.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/15389/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/15389/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_91.txt JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15389/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs-client hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15389/console Powered by Apache Yetus 0.2.0 http://yetus.apache.org This message was automatically generated.
        Hide
        walter.k.su Walter Su added a comment -

        The last patch looks pretty good. +1 once the test nits get addressed. Thanks Nicolas Fraison for the contribution. Thanks Ravi Prakash and Mingliang Liu for the good advice and review.

        Show
        walter.k.su Walter Su added a comment - The last patch looks pretty good. +1 once the test nits get addressed. Thanks Nicolas Fraison for the contribution. Thanks Ravi Prakash and Mingliang Liu for the good advice and review.
        Hide
        kihwal Kihwal Lee added a comment -

        Sorry for being late in the party. 200ms of write locking is an eternity. Since small delay in releasing leases is not critical and the check runs roughly every 2 seconds, can we reduce the limit to something shorter? How long did it take to release 5M in your case? Was logging inside the write lock slowing things down significantly? Since the limit is not configurable, we should make sure the fixed value is reasonable. The rest of the patch looks good.

        Show
        kihwal Kihwal Lee added a comment - Sorry for being late in the party. 200ms of write locking is an eternity. Since small delay in releasing leases is not critical and the check runs roughly every 2 seconds, can we reduce the limit to something shorter? How long did it take to release 5M in your case? Was logging inside the write lock slowing things down significantly? Since the limit is not configurable, we should make sure the fixed value is reasonable. The rest of the patch looks good.
        Hide
        nfraison.criteo Nicolas Fraison added a comment -

        Kihwal Lee It takes around 20s to release 100K path so for a 200ms lock we will release around 1K path. We could reduce it to 100ms so it will take around 6H to release 5M path (taking in account the run every 2s).
        The logging surely has an impact but I don't have any metrics to assert that it slowing things down significantly.

        Show
        nfraison.criteo Nicolas Fraison added a comment - Kihwal Lee It takes around 20s to release 100K path so for a 200ms lock we will release around 1K path. We could reduce it to 100ms so it will take around 6H to release 5M path (taking in account the run every 2s). The logging surely has an impact but I don't have any metrics to assert that it slowing things down significantly.
        Hide
        kihwal Kihwal Lee added a comment - - edited

        The throughput is pathetic, but it seems in the ballpark of what I have seen. In my experience, the commitBlockSynchronization() load generated by lease recovery also affects performance greatly. The lease recovery may fill up the edit buffer and cause auto-sync. Depending on the speed of edit syncing, a massive lease recovery can overwhelm the edit buffering and I/O. It will be nice if find out what the actual bottleneck is, so that we can improve the performance.

        The average rpc time of commitBlockSynchronization() I observed lately is around 600us. After releasing 1K paths, there can be 1K commitBlockSynchronization() calls in the worst case. That will translate to 600ms, so cumulatively about 800ms will be spent in the namespace write lock. Since the lease manager sleeps for 2 seconds, the NN will be spending about 0.8/2.2 = 36% of time exclusively on lease/block recovery.

        This might only be acceptable to lightly loaded namenodes. Setting the limit to 100ms will lower it to 24% and 50ms will make it 12%. We also have to weigh in how important these lease recoveries are. I don't think this kind of mass lease recoveries are normal. These are usually caused by faulty user code (e.g. not closing files before committing). This should not penalize other users by greatly degrading NN performance. So I lean toward something like 50ms or shorter.

        I want to hear what others think.

        Show
        kihwal Kihwal Lee added a comment - - edited The throughput is pathetic, but it seems in the ballpark of what I have seen. In my experience, the commitBlockSynchronization() load generated by lease recovery also affects performance greatly. The lease recovery may fill up the edit buffer and cause auto-sync. Depending on the speed of edit syncing, a massive lease recovery can overwhelm the edit buffering and I/O. It will be nice if find out what the actual bottleneck is, so that we can improve the performance. The average rpc time of commitBlockSynchronization() I observed lately is around 600us. After releasing 1K paths, there can be 1K commitBlockSynchronization() calls in the worst case. That will translate to 600ms, so cumulatively about 800ms will be spent in the namespace write lock. Since the lease manager sleeps for 2 seconds, the NN will be spending about 0.8/2.2 = 36% of time exclusively on lease/block recovery. This might only be acceptable to lightly loaded namenodes. Setting the limit to 100ms will lower it to 24% and 50ms will make it 12%. We also have to weigh in how important these lease recoveries are. I don't think this kind of mass lease recoveries are normal. These are usually caused by faulty user code (e.g. not closing files before committing). This should not penalize other users by greatly degrading NN performance. So I lean toward something like 50ms or shorter. I want to hear what others think.
        Hide
        daryn Daryn Sharp added a comment -

        IMHO lease recovery is not critical enough to warrant an impact on throughput. The callq on large busy clusters (ex. ~30k ops/sec) will easily fill or overflow. I'd rather see it be more on the order of single-digit ms. At the end of cycle, if it broke out early then perhaps it could sleep for less than 2s.

        Show
        daryn Daryn Sharp added a comment - IMHO lease recovery is not critical enough to warrant an impact on throughput. The callq on large busy clusters (ex. ~30k ops/sec) will easily fill or overflow. I'd rather see it be more on the order of single-digit ms. At the end of cycle, if it broke out early then perhaps it could sleep for less than 2s.
        Hide
        nfraison.criteo Nicolas Fraison added a comment -

        What do you think about setting the timeout to 5 ms and to reduce the sleep to 500 ms if it broke out early?
        With this setting the NN will spend around 4% on lease/block recovery.

        Show
        nfraison.criteo Nicolas Fraison added a comment - What do you think about setting the timeout to 5 ms and to reduce the sleep to 500 ms if it broke out early? With this setting the NN will spend around 4% on lease/block recovery.
        Hide
        yzhangal Yongjun Zhang added a comment -

        I happen to see this jira now.

        Hi Daryn Sharp, do you suggest to dynamically adjust the lease check interval by saying "if it broke out early then perhaps it could sleep for less than 2s"?

        I agree with Kihwal Lee's "I don't think this kind of mass lease recoveries are normal" comment. I wonder if we could just make both MAX_LOCK_HOLD_TO_RELEASE_LEASE_MS and the lease check interval config parameters instead of fixed numbers. If this config is there, it can solve the abnormal situation. I know we have many configs already though. What do you guys think?

        Thanks.

        Show
        yzhangal Yongjun Zhang added a comment - I happen to see this jira now. Hi Daryn Sharp , do you suggest to dynamically adjust the lease check interval by saying "if it broke out early then perhaps it could sleep for less than 2s"? I agree with Kihwal Lee 's "I don't think this kind of mass lease recoveries are normal" comment. I wonder if we could just make both MAX_LOCK_HOLD_TO_RELEASE_LEASE_MS and the lease check interval config parameters instead of fixed numbers. If this config is there, it can solve the abnormal situation. I know we have many configs already though. What do you guys think? Thanks.
        Hide
        nfraison.criteo Nicolas Fraison added a comment -

        Yongjun Zhang this is how I have understand the Daryn Sharp comment, having the release running every 2s during 5ms and if the release is not finished after 5ms reducing the sleep to 500ms before next release. Once the release finished well in 5ms get back to the 2s sleep.
        I agree that this doesn't happens on normal behaviour but on our quite 2 big clusters (40Po) with around 100k jobs running every day on each with lots of distcp syncing data between both of them I can see some times (2 to 4 times a month) some quite big release of path (around 200 to 400k) that could affect the hdfs layer (we have applied this patch so it is not the case). Even if the issues are always due to something bad from the jobs running I think that the namenode should be protective here to avoid affecting the whole cluster.
        In the first patch the config was a configurable parameters but there was a consensus to move it to a constant parameter. So I would like to have feedback from other guys on this to not do and redo this.

        Thanks for your feedback

        Show
        nfraison.criteo Nicolas Fraison added a comment - Yongjun Zhang this is how I have understand the Daryn Sharp comment, having the release running every 2s during 5ms and if the release is not finished after 5ms reducing the sleep to 500ms before next release. Once the release finished well in 5ms get back to the 2s sleep. I agree that this doesn't happens on normal behaviour but on our quite 2 big clusters (40Po) with around 100k jobs running every day on each with lots of distcp syncing data between both of them I can see some times (2 to 4 times a month) some quite big release of path (around 200 to 400k) that could affect the hdfs layer (we have applied this patch so it is not the case). Even if the issues are always due to something bad from the jobs running I think that the namenode should be protective here to avoid affecting the whole cluster. In the first patch the config was a configurable parameters but there was a consensus to move it to a constant parameter. So I would like to have feedback from other guys on this to not do and redo this. Thanks for your feedback
        Hide
        raviprak Ravi Prakash added a comment -

        I think this probably happens rarely so having dynamic duty cycles may be overkill IMHO. If you can make the MAX_LOCK_HOLD_TO_RELEASE_LEASE_MS and the lease check interval config parameters, I'll +1 and commit. We've vascillated on this long enough.

        Show
        raviprak Ravi Prakash added a comment - I think this probably happens rarely so having dynamic duty cycles may be overkill IMHO. If you can make the MAX_LOCK_HOLD_TO_RELEASE_LEASE_MS and the lease check interval config parameters, I'll +1 and commit. We've vascillated on this long enough.
        Hide
        nfraison.criteo Nicolas Fraison added a comment -

        It seems that patch HADOOP-10220.007.patch provided a week ago has not been automatically tested.
        Is there a way to force a test?

        Show
        nfraison.criteo Nicolas Fraison added a comment - It seems that patch HADOOP-10220 .007.patch provided a week ago has not been automatically tested. Is there a way to force a test?
        Show
        arpitagarwal Arpit Agarwal added a comment - Started https://builds.apache.org/view/PreCommit%20Builds/job/PreCommit-HDFS-Build/15706/
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 24s Docker mode activated.
        +1 @author 0m 1s The patch does not contain any @author tags.
        +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
        +1 mvninstall 6m 2s trunk passed
        +1 compile 0m 43s trunk passed
        +1 checkstyle 0m 33s trunk passed
        +1 mvnsite 1m 3s trunk passed
        +1 mvneclipse 0m 10s trunk passed
        +1 findbugs 1m 43s trunk passed
        +1 javadoc 1m 5s trunk passed
        +1 mvninstall 0m 52s the patch passed
        +1 compile 0m 58s the patch passed
        +1 javac 0m 58s the patch passed
        -1 checkstyle 0m 36s hadoop-hdfs-project/hadoop-hdfs: The patch generated 4 new + 622 unchanged - 5 fixed = 626 total (was 627)
        +1 mvnsite 1m 8s the patch passed
        +1 mvneclipse 0m 12s the patch passed
        +1 whitespace 0m 0s The patch has no whitespace issues.
        +1 xml 0m 1s The patch has no ill-formed XML file.
        +1 findbugs 2m 13s the patch passed
        +1 javadoc 1m 4s the patch passed
        -1 unit 78m 27s hadoop-hdfs in the patch failed.
        +1 asflicense 0m 21s The patch does not generate ASF License warnings.
        99m 5s



        Reason Tests
        Failed junit tests hadoop.hdfs.server.namenode.TestReconstructStripedBlocks



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:2c91fd8
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12807364/HADOOP-10220.007.patch
        JIRA Issue HDFS-10220
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle xml
        uname Linux 989be55ed1a7 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / 3344ba7
        Default Java 1.8.0_91
        findbugs v3.0.0
        checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/15706/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt
        unit https://builds.apache.org/job/PreCommit-HDFS-Build/15706/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
        unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/15706/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
        Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15706/testReport/
        modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
        Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15706/console
        Powered by Apache Yetus 0.3.0 http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 24s Docker mode activated. +1 @author 0m 1s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 6m 2s trunk passed +1 compile 0m 43s trunk passed +1 checkstyle 0m 33s trunk passed +1 mvnsite 1m 3s trunk passed +1 mvneclipse 0m 10s trunk passed +1 findbugs 1m 43s trunk passed +1 javadoc 1m 5s trunk passed +1 mvninstall 0m 52s the patch passed +1 compile 0m 58s the patch passed +1 javac 0m 58s the patch passed -1 checkstyle 0m 36s hadoop-hdfs-project/hadoop-hdfs: The patch generated 4 new + 622 unchanged - 5 fixed = 626 total (was 627) +1 mvnsite 1m 8s the patch passed +1 mvneclipse 0m 12s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 xml 0m 1s The patch has no ill-formed XML file. +1 findbugs 2m 13s the patch passed +1 javadoc 1m 4s the patch passed -1 unit 78m 27s hadoop-hdfs in the patch failed. +1 asflicense 0m 21s The patch does not generate ASF License warnings. 99m 5s Reason Tests Failed junit tests hadoop.hdfs.server.namenode.TestReconstructStripedBlocks Subsystem Report/Notes Docker Image:yetus/hadoop:2c91fd8 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12807364/HADOOP-10220.007.patch JIRA Issue HDFS-10220 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle xml uname Linux 989be55ed1a7 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 3344ba7 Default Java 1.8.0_91 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/15706/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/15706/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/15706/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15706/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15706/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
        Hide
        kihwal Kihwal Lee added a comment -

        The patch looks fine to me. Ravi Prakash, how about you?

        Show
        kihwal Kihwal Lee added a comment - The patch looks fine to me. Ravi Prakash , how about you?
        Hide
        raviprak Ravi Prakash added a comment -

        Awesome! Patch looks good to me too. Will commit shortly. Thanks for the patch Nicolas, and for the reviews Kihwal, Arpit, Yongjun, Daryn, Walter, Mingliang and Vinaykumar!

        Show
        raviprak Ravi Prakash added a comment - Awesome! Patch looks good to me too. Will commit shortly. Thanks for the patch Nicolas, and for the reviews Kihwal, Arpit, Yongjun, Daryn, Walter, Mingliang and Vinaykumar!
        Hide
        raviprak Ravi Prakash added a comment -

        I've committed this to trunk and branch-2. Please let me know if I should put this in branch-2.8 as well.

        Show
        raviprak Ravi Prakash added a comment - I've committed this to trunk and branch-2. Please let me know if I should put this in branch-2.8 as well.
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Hadoop-trunk-Commit #9933 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9933/)
        HDFS-10220. A large number of expired leases can make namenode (raviprak: rev ae047655f4355288406cd5396fb4e3ea7c307b14)

        • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java
        • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/LeaseManager.java
        • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestLeaseManager.java
        • hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml
        • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
        • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/common/HdfsServerConstants.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-trunk-Commit #9933 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9933/ ) HDFS-10220 . A large number of expired leases can make namenode (raviprak: rev ae047655f4355288406cd5396fb4e3ea7c307b14) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/LeaseManager.java hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestLeaseManager.java hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/common/HdfsServerConstants.java
        Hide
        zhz Zhe Zhang added a comment -

        Nicolas Fraison, Ravi Prakash I think this is a valid fix for 2.8~2.6. LMK if you have time to backport. I'm happy to do it too. Thanks.

        Show
        zhz Zhe Zhang added a comment - Nicolas Fraison , Ravi Prakash I think this is a valid fix for 2.8~2.6. LMK if you have time to backport. I'm happy to do it too. Thanks.
        Hide
        raviprak Ravi Prakash added a comment -

        I've merged this into branch-2.8 . However there were conflicts for branch-2.7 .

        Show
        raviprak Ravi Prakash added a comment - I've merged this into branch-2.8 . However there were conflicts for branch-2.7 .
        Hide
        zhz Zhe Zhang added a comment -

        Agreed. HDFS-6757 made a pretty big change which is hard to get around. Thanks Ravi for backporting to 2.7.

        Show
        zhz Zhe Zhang added a comment - Agreed. HDFS-6757 made a pretty big change which is hard to get around. Thanks Ravi for backporting to 2.7.
        Hide
        liuml07 Mingliang Liu added a comment -

        Thanks for the explanation. I think the wiki page is stale and we should fix this after confirmation.

        Show
        liuml07 Mingliang Liu added a comment - Thanks for the explanation. I think the wiki page is stale and we should fix this after confirmation.

          People

          • Assignee:
            nfraison.criteo Nicolas Fraison
            Reporter:
            nfraison.criteo Nicolas Fraison
          • Votes:
            0 Vote for this issue
            Watchers:
            14 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development