Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-1476

listCorruptFileBlocks should be functional while the name node is still in safe mode

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.23.0
    • Fix Version/s: 0.23.0
    • Component/s: namenode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      This would allow us to detect whether missing blocks can be fixed using Raid and if that is the case exit safe mode earlier.

      One way to make listCorruptFileBlocks available before the name node has exited from safe mode would be to perform a scan of the blocks map on each call to listCorruptFileBlocks to determine if there are any blocks with no replicas. This scan could be parallelized by dividing the space of block IDs into multiple intervals than can be scanned independently.

      1. HDFS-1476.2.patch
        16 kB
        Patrick Kling
      2. HDFS-1476.3.patch
        16 kB
        Patrick Kling
      3. HDFS-1476.4.patch
        16 kB
        Patrick Kling
      4. HDFS-1476.5.patch
        17 kB
        Patrick Kling
      5. HDFS-1476.patch
        16 kB
        Patrick Kling

        Issue Links

          Activity

          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #968 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/968/)
          HDFS-2826. Add test case for HDFS-1476 (safemode can initialize replication queues before exiting) (todd)

          todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1235067
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestSafeMode.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NameNodeAdapter.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #968 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/968/ ) HDFS-2826 . Add test case for HDFS-1476 (safemode can initialize replication queues before exiting) (todd) todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1235067 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestSafeMode.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NameNodeAdapter.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-0.23-Build #170 (See https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Build/170/)
          HDFS-2826. Add test case for HDFS-1476 (safemode can initialize replication queues before exiting) (todd)

          todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1235068
          Files :

          • /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
          • /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestSafeMode.java
          • /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NameNodeAdapter.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-0.23-Build #170 (See https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Build/170/ ) HDFS-2826 . Add test case for HDFS-1476 (safemode can initialize replication queues before exiting) (todd) todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1235068 Files : /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestSafeMode.java /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NameNodeAdapter.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-0.23-Build #148 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/148/)
          HDFS-2826. Add test case for HDFS-1476 (safemode can initialize replication queues before exiting) (todd)

          todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1235068
          Files :

          • /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
          • /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestSafeMode.java
          • /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NameNodeAdapter.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-0.23-Build #148 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/148/ ) HDFS-2826 . Add test case for HDFS-1476 (safemode can initialize replication queues before exiting) (todd) todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1235068 Files : /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestSafeMode.java /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NameNodeAdapter.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #935 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/935/)
          HDFS-2826. Add test case for HDFS-1476 (safemode can initialize replication queues before exiting) (todd)

          todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1235067
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestSafeMode.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NameNodeAdapter.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #935 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/935/ ) HDFS-2826 . Add test case for HDFS-1476 (safemode can initialize replication queues before exiting) (todd) todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1235067 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestSafeMode.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NameNodeAdapter.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk-Commit #1592 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/1592/)
          HDFS-2826. Add test case for HDFS-1476 (safemode can initialize replication queues before exiting) (todd)

          todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1235067
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestSafeMode.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NameNodeAdapter.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #1592 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/1592/ ) HDFS-2826 . Add test case for HDFS-1476 (safemode can initialize replication queues before exiting) (todd) todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1235067 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestSafeMode.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NameNodeAdapter.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk-Commit #1648 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/1648/)
          HDFS-2826. Add test case for HDFS-1476 (safemode can initialize replication queues before exiting) (todd)

          todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1235067
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestSafeMode.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NameNodeAdapter.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #1648 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/1648/ ) HDFS-2826 . Add test case for HDFS-1476 (safemode can initialize replication queues before exiting) (todd) todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1235067 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestSafeMode.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NameNodeAdapter.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-0.23-Commit #424 (See https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Commit/424/)
          HDFS-2826. Add test case for HDFS-1476 (safemode can initialize replication queues before exiting) (todd)

          todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1235068
          Files :

          • /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
          • /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestSafeMode.java
          • /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NameNodeAdapter.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-0.23-Commit #424 (See https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Commit/424/ ) HDFS-2826 . Add test case for HDFS-1476 (safemode can initialize replication queues before exiting) (todd) todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1235068 Files : /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestSafeMode.java /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NameNodeAdapter.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-0.23-Commit #399 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Commit/399/)
          HDFS-2826. Add test case for HDFS-1476 (safemode can initialize replication queues before exiting) (todd)

          todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1235068
          Files :

          • /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
          • /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestSafeMode.java
          • /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NameNodeAdapter.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-0.23-Commit #399 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Commit/399/ ) HDFS-2826 . Add test case for HDFS-1476 (safemode can initialize replication queues before exiting) (todd) todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1235068 Files : /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestSafeMode.java /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NameNodeAdapter.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-0.23-Commit #409 (See https://builds.apache.org/job/Hadoop-Common-0.23-Commit/409/)
          HDFS-2826. Add test case for HDFS-1476 (safemode can initialize replication queues before exiting) (todd)

          todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1235068
          Files :

          • /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
          • /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestSafeMode.java
          • /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NameNodeAdapter.java
          Show
          Hudson added a comment - Integrated in Hadoop-Common-0.23-Commit #409 (See https://builds.apache.org/job/Hadoop-Common-0.23-Commit/409/ ) HDFS-2826 . Add test case for HDFS-1476 (safemode can initialize replication queues before exiting) (todd) todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1235068 Files : /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestSafeMode.java /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NameNodeAdapter.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk-Commit #1575 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/1575/)
          HDFS-2826. Add test case for HDFS-1476 (safemode can initialize replication queues before exiting) (todd)

          todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1235067
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestSafeMode.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NameNodeAdapter.java
          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #1575 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/1575/ ) HDFS-2826 . Add test case for HDFS-1476 (safemode can initialize replication queues before exiting) (todd) todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1235067 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestSafeMode.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NameNodeAdapter.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #643 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/643/)

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #643 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/643/ )
          Hide
          Hairong Kuang added a comment -

          I've just committed this. Thanks Patrick!

          Show
          Hairong Kuang added a comment - I've just committed this. Thanks Patrick!
          Hide
          Hairong Kuang added a comment -

          +1. This looks good to me.

          Show
          Hairong Kuang added a comment - +1. This looks good to me.
          Hide
          Patrick Kling added a comment -
          • Updated patch to apply to current trunk.
          • In BlockManager.markBlockAsCorrupt() only update needed replication queues if they have been initialized

          ant test-patch results:

               [exec] +1 overall.  
               [exec] 
               [exec]     +1 @author.  The patch does not contain any @author tags.
               [exec] 
               [exec]     +1 tests included.  The patch appears to include 6 new or modified tests.
               [exec] 
               [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
               [exec] 
               [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
               [exec] 
               [exec]     +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.
               [exec] 
               [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
               [exec] 
               [exec]     +1 system test framework.  The patch passed system test framework compile.
          

          ant test failures (same as on clean trunk):

           [junit] Test org.apache.hadoop.hdfs.server.datanode.TestBlockRecovery FAILED
              [junit] Test org.apache.hadoop.hdfs.TestHDFSServerPorts FAILED
              [junit] Test org.apache.hadoop.hdfs.TestHDFSTrash FAILED (timeout)
              [junit] Test org.apache.hadoop.hdfs.server.namenode.TestBackupNode FAILED
              [junit] Test org.apache.hadoop.hdfs.server.namenode.TestStorageRestore FAILED
              [junit] Test org.apache.hadoop.hdfs.TestFileConcurrentReader FAILED (timeout)
              [junit] Test org.apache.hadoop.hdfs.server.namenode.TestLargeDirectoryDelete FAILED (timeout)
              [junit] Test org.apache.hadoop.hdfs.server.datanode.TestBlockRecovery FAILED
          
          Show
          Patrick Kling added a comment - Updated patch to apply to current trunk. In BlockManager.markBlockAsCorrupt() only update needed replication queues if they have been initialized ant test-patch results: [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 6 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec] [exec] +1 system test framework. The patch passed system test framework compile. ant test failures (same as on clean trunk): [junit] Test org.apache.hadoop.hdfs.server.datanode.TestBlockRecovery FAILED [junit] Test org.apache.hadoop.hdfs.TestHDFSServerPorts FAILED [junit] Test org.apache.hadoop.hdfs.TestHDFSTrash FAILED (timeout) [junit] Test org.apache.hadoop.hdfs.server.namenode.TestBackupNode FAILED [junit] Test org.apache.hadoop.hdfs.server.namenode.TestStorageRestore FAILED [junit] Test org.apache.hadoop.hdfs.TestFileConcurrentReader FAILED (timeout) [junit] Test org.apache.hadoop.hdfs.server.namenode.TestLargeDirectoryDelete FAILED (timeout) [junit] Test org.apache.hadoop.hdfs.server.datanode.TestBlockRecovery FAILED
          Hide
          Patrick Kling added a comment -

          Updated test case to play nice with HDFS-1482.

          Show
          Patrick Kling added a comment - Updated test case to play nice with HDFS-1482 .
          Hide
          Hairong Kuang added a comment -

          I thought more about this. ProcessMisReplicatedBlocks scans all blocks in NN during which holding the namesystem write lock. This will block the processing of block reports. It would be nice if it can release the lock periodically.

          Show
          Hairong Kuang added a comment - I thought more about this. ProcessMisReplicatedBlocks scans all blocks in NN during which holding the namesystem write lock. This will block the processing of block reports. It would be nice if it can release the lock periodically.
          Hide
          Hairong Kuang added a comment -

          +1. Looks good to me. Using the safemode exiting threshold is fine with me too.

          Show
          Hairong Kuang added a comment - +1. Looks good to me. Using the safemode exiting threshold is fine with me too.
          Hide
          Patrick Kling added a comment -

          Changed default value of replication queue threshold to safe mode threshold.

          Show
          Patrick Kling added a comment - Changed default value of replication queue threshold to safe mode threshold.
          Hide
          Hairong Kuang added a comment -

          > the safemode exiting threshold
          I meant the default safemode exiting threshold.

          Show
          Hairong Kuang added a comment - > the safemode exiting threshold I meant the default safemode exiting threshold.
          Hide
          Hairong Kuang added a comment -

          I really like this idea. It is simple, does not need lot of code change, but should work out very well. It has an extra bonus to make exiting safemode quicker.

          On minor comment, the default value of the threshold is better to be the safemode exiting threshold.

          Show
          Hairong Kuang added a comment - I really like this idea. It is simple, does not need lot of code change, but should work out very well. It has an extra bonus to make exiting safemode quicker. On minor comment, the default value of the threshold is better to be the safemode exiting threshold.
          Hide
          Patrick Kling added a comment -

          The findbugs warnings are because of the problem described in MAPREDUCE-2172. test ant-patch with an empty patch yields 3 findbugs warning and 97 release audit warning, so this patch does not introduce any new problems.

          Show
          Patrick Kling added a comment - The findbugs warnings are because of the problem described in MAPREDUCE-2172 . test ant-patch with an empty patch yields 3 findbugs warning and 97 release audit warning, so this patch does not introduce any new problems.
          Hide
          dhruba borthakur added a comment -

          The code looks good, +1.

          There seems to be 3 findbugs warnings, can you fix them? Once done, I will commit it.

          Show
          dhruba borthakur added a comment - The code looks good, +1. There seems to be 3 findbugs warnings, can you fix them? Once done, I will commit it.
          Hide
          Patrick Kling added a comment -
               [exec] -1 overall.  
               [exec] 
               [exec]     +1 @author.  The patch does not contain any @author tags.
               [exec] 
               [exec]     +1 tests included.  The patch appears to include 6 new or modified tests.
               [exec] 
               [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
               [exec] 
               [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
               [exec] 
               [exec]     -1 findbugs.  The patch appears to introduce 3 new Findbugs warnings.
               [exec] 
               [exec]     -1 release audit.  The applied patch generated 97 release audit warnings (more than the trunk's current 1 warnings).
               [exec] 
               [exec]     +1 system test framework.  The patch passed system test framework compile.
          

          The findbugs/release audit warnings are caused by the issue described in MAPREDUCE-2172.

          Show
          Patrick Kling added a comment - [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 6 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] -1 findbugs. The patch appears to introduce 3 new Findbugs warnings. [exec] [exec] -1 release audit. The applied patch generated 97 release audit warnings (more than the trunk's current 1 warnings). [exec] [exec] +1 system test framework. The patch passed system test framework compile. The findbugs/release audit warnings are caused by the issue described in MAPREDUCE-2172 .
          Hide
          Patrick Kling added a comment -

          Incorporated Dhruba's feedback on the review board. Thank you!

          Show
          Patrick Kling added a comment - Incorporated Dhruba's feedback on the review board. Thank you!
          Hide
          Patrick Kling added a comment -

          This patch introduces a new configuration variable dfs.namenode.replqueue.threshold-pct that determines the fraction of blocks for which block reports have to be received before the NameNode will start initializing the needed replication queues. Once a sufficient number of block reports have been received, the queues are initialized while the NameNode is still in safe mode. After the queues are initialized, subsequent block reports are handled by updating the queues incrementally.

          The benefit of this is twofold:

          • It allows us to compute the replication queues while we are waiting for the last few block reports (when the NameNode is mostly idle). Once these block reports have been received, we can then immediately leave safe mode without having to wait for the computation of the needed replication queues (which requires a full traversal of the blocks map).
          • With Raid, it may not be necessary to stay in safe mode until all blocks have been reported. Using this change, we could monitor if all of the missing blocks can be recreated using parity information and if so leave safe mode early. In order for this monitoring to work, we need access to the needed replication queues while the NameNode is still in safe mode.

          The review board entry for this patch can be found at https://reviews.apache.org/r/105/ .

          Show
          Patrick Kling added a comment - This patch introduces a new configuration variable dfs.namenode.replqueue.threshold-pct that determines the fraction of blocks for which block reports have to be received before the NameNode will start initializing the needed replication queues. Once a sufficient number of block reports have been received, the queues are initialized while the NameNode is still in safe mode. After the queues are initialized, subsequent block reports are handled by updating the queues incrementally. The benefit of this is twofold: It allows us to compute the replication queues while we are waiting for the last few block reports (when the NameNode is mostly idle). Once these block reports have been received, we can then immediately leave safe mode without having to wait for the computation of the needed replication queues (which requires a full traversal of the blocks map). With Raid, it may not be necessary to stay in safe mode until all blocks have been reported. Using this change, we could monitor if all of the missing blocks can be recreated using parity information and if so leave safe mode early. In order for this monitoring to work, we need access to the needed replication queues while the NameNode is still in safe mode. The review board entry for this patch can be found at https://reviews.apache.org/r/105/ .
          Hide
          dhruba borthakur added a comment -

          Thinking more about this one, we can exit safemode faster if we can compute misReplicatedBlocks even before we have one replica of all blocks.

          Step 1: the namenode waits to ensure that there is at least one replica of all known blocks.
          Step 2: Then it invokes processMisReplicatedBlocks to update neededReplication

          When the cluster restarts, the namenode starts in Step 1 and starts processing a storm of block reports from all datanodes. But a few datanodes are somewhat slow and the block report from the straggler datanodes delays the transition from Step 1 to Step 2. The CPU usage on the NN decreases exponentially as Step 1 progresses and becomes almost negligible when Step 1 is about to end.

          This jira could change the code so that processMisReplicatedBlocks is invoked before Step 1 finishes completely. This will make the NN exit safemode earlier

          Show
          dhruba borthakur added a comment - Thinking more about this one, we can exit safemode faster if we can compute misReplicatedBlocks even before we have one replica of all blocks. Step 1: the namenode waits to ensure that there is at least one replica of all known blocks. Step 2: Then it invokes processMisReplicatedBlocks to update neededReplication When the cluster restarts, the namenode starts in Step 1 and starts processing a storm of block reports from all datanodes. But a few datanodes are somewhat slow and the block report from the straggler datanodes delays the transition from Step 1 to Step 2. The CPU usage on the NN decreases exponentially as Step 1 progresses and becomes almost negligible when Step 1 is about to end. This jira could change the code so that processMisReplicatedBlocks is invoked before Step 1 finishes completely. This will make the NN exit safemode earlier
          Hide
          Dmytro Molkov added a comment -

          I think the parallelism here will depend on multiple different characteristics: the number of cores in the system, the size of the blocks map, maybe others. So it might make sense to make this one configurable to begin with. Large clusters will certainly benefit from running multiple parallel threads since the time to scan full BlocksMap will be in minutes.

          Show
          Dmytro Molkov added a comment - I think the parallelism here will depend on multiple different characteristics: the number of cores in the system, the size of the blocks map, maybe others. So it might make sense to make this one configurable to begin with. Large clusters will certainly benefit from running multiple parallel threads since the time to scan full BlocksMap will be in minutes.
          Hide
          dhruba borthakur added a comment -

          We can first enhance the FSNamesystem.listCorruptFileBlocks() to return a correct list of files even when the namenode is in safemode. It can make one pass over all the blocks in the BlocksMap and return only those blocks that do not have any replicas. We can then measure the time it takes to make this RPC, and depending on its performance we can decide whether to parallelize it or not.

          Show
          dhruba borthakur added a comment - We can first enhance the FSNamesystem.listCorruptFileBlocks() to return a correct list of files even when the namenode is in safemode. It can make one pass over all the blocks in the BlocksMap and return only those blocks that do not have any replicas. We can then measure the time it takes to make this RPC, and depending on its performance we can decide whether to parallelize it or not.

            People

            • Assignee:
              Patrick Kling
              Reporter:
              Patrick Kling
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development