Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.21.0
    • Fix Version/s: 0.21.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      TestFileAppend3 hangs because it fails on close the file. The following is the snippet of logs that shows the cause of the problem:
      [junit] 2009-10-01 07:00:00,719 WARN hdfs.DFSClient (DFSClient.java:setupPipelineForAppendOrRecovery(3004)) - Error Recovery for block blk_-4098350497078465335_1007 in pipeline 127.0.0.1:58375, 127.0.0.1:36982: bad datanode 127.0.0.1:36982
      [junit] 2009-10-01 07:00:00,721 INFO datanode.DataNode (DataXceiver.java:opWriteBlock(224)) - Receiving block blk_-4098350497078465335_1007 src: /127.0.0.1:40252 dest: /127.0.0.1:58375
      [junit] 2009-10-01 07:00:00,721 INFO datanode.DataNode (FSDataset.java:recoverClose(1248)) - Recover failed close blk_-4098350497078465335_1007
      [junit] 2009-10-01 07:00:00,723 INFO datanode.DataNode (DataXceiver.java:opWriteBlock(369)) - Received block blk_-4098350497078465335_1008 src: /127.0.0.1:40252 dest: /127.0.0.1:58375 of size 65536
      [junit] 2009-10-01 07:00:00,724 INFO hdfs.StateChange (BlockManager.java:addStoredBlock(1006)) - BLOCK* NameSystem.addStoredBlock: addStoredBlock request received for blk_-4098350497078465335_1008 on 127.0.0.1:58375 size 65536 But it does not belong to any file.
      [junit] 2009-10-01 07:00:00,724 INFO namenode.FSNamesystem (FSNamesystem.java:updatePipeline(3946)) - updatePipeline(block=blk_-4098350497078465335_1007, newGenerationStamp=1008, newLength=65536, newNodes=[127.0.0.1:58375], clientName=DFSClient_995688145)

      1. hdfs-668.patch
        2 kB
        Konstantin Shvachko
      2. loop.patch
        8 kB
        Hairong Kuang
      3. loop1.patch
        8 kB
        Hairong Kuang

        Activity

        Hide
        Hudson added a comment -

        Integrated in Hdfs-Patch-h5.grid.sp2.yahoo.net #78 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/78/)

        Show
        Hudson added a comment - Integrated in Hdfs-Patch-h5.grid.sp2.yahoo.net #78 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/78/ )
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk #120 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk/120/)

        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #120 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk/120/ )
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk-Commit #79 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/79/)

        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #79 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/79/ )
        Hide
        Hudson added a comment -

        Integrated in Hdfs-Patch-h2.grid.sp2.yahoo.net #47 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/47/)

        Show
        Hudson added a comment - Integrated in Hdfs-Patch-h2.grid.sp2.yahoo.net #47 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/47/ )
        Hide
        Hairong Kuang added a comment -

        I just resolved this!

        Show
        Hairong Kuang added a comment - I just resolved this!
        Hide
        Hairong Kuang added a comment -

        Two failed unit tests are all known ones.

        Show
        Hairong Kuang added a comment - Two failed unit tests are all known ones.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12422400/loop1.patch
        against trunk revision 826086.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/73/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/73/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/73/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/73/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12422400/loop1.patch against trunk revision 826086. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/73/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/73/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/73/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/73/console This message is automatically generated.
        Hide
        Hairong Kuang added a comment -

        > Should we file a new jira to fix the problem in INodeFile.getLastBlock()?
        +1

        Show
        Hairong Kuang added a comment - > Should we file a new jira to fix the problem in INodeFile.getLastBlock()? +1
        Hide
        Tsz Wo Nicholas Sze added a comment -

        +1 patch look good.

        Should we file a new jira to fix the problem in INodeFile.getLastBlock()?

        Show
        Tsz Wo Nicholas Sze added a comment - +1 patch look good. Should we file a new jira to fix the problem in INodeFile.getLastBlock()?
        Hide
        Hairong Kuang added a comment -

        Thank Nicholas' review comments. This patch incorporates the first two comments.

        For the last comment, there is a check that makes sure the last block of the file under construction is not null and is under construction.

        Show
        Hairong Kuang added a comment - Thank Nicholas' review comments. This patch incorporates the first two comments. For the last comment, there is a check that makes sure the last block of the file under construction is not null and is under construction.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12422310/loop.patch
        against trunk revision 825689.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/72/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/72/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/72/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/72/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12422310/loop.patch against trunk revision 825689. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/72/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/72/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/72/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/72/console This message is automatically generated.
        Hide
        Tsz Wo Nicholas Sze added a comment -
        • In BlocksMap,
          +   * Update the old block with the new block.
          +   * 
          +   * The new block has a newer generation stamp so it requires remove
          +   * the old entry first and reinsert the new entry
          +   * 
          +   * @return the removed stored block in the map
          +   */
          +  BlockInfo updateBlock(Block oldBlock, Block newBlock) {
          +    BlockInfo blockInfo = map.remove(oldBlock);
          +    blockInfo.setGenerationStamp(newBlock.getGenerationStamp());
          +    blockInfo.setNumBytes(newBlock.getNumBytes());
          +    map.put(blockInfo, blockInfo);
          +    return blockInfo;
          +  }
          
          • It is better to check oldBlock.getBlockId() == newBlock.getBlockId() or change updateBlock(..) to updateBlock(Block b, long newGenerationStamp, long newLength).
          • The stored block is added back. So the javadoc "@return the removed stored block in the map" sounds incorrect.
        • In FSNamesystem,
          @@ -1399,6 +1399,9 @@
                 //
                 for (BlockInfo block: v.getBlocks()) {
                   if (!blockManager.checkMinReplication(block)) {
          +          NameNode.stateChangeLog.info("BLOCK* NameSystem.checkFileProgress: "
          +              + "block " + block + " has not reached minimal replication "
          +              + blockManager.minReplication);
                     return false;
                   }
                 }
          @@ -1408,6 +1411,9 @@
                 //
                 BlockInfo b = v.getPenultimateBlock();
                 if (b != null && !blockManager.checkMinReplication(b)) {
          +        NameNode.stateChangeLog.info("BLOCK* NameSystem.checkFileProgress: "
          +            + "block " + b + " has not reached minimal replication "
          +            + blockManager.minReplication);
                   return false;
                 }
               }
          
          • These two log messages does not look like "state changes". Should we use FSNamesystem.LOG instead?
        • In FSNamesystem,
          -    final BlockInfo oldblockinfo = pendingFile.getLastBlock();
          +    final BlockInfoUnderConstruction blockinfo = pendingFile.getLastBlock();
          
          • Could blockinfo be null?
          • Is it the case that the last block must be a BlockInfoUnderConstruction? I am afarid that an IOException caused by a ClassCastException may be thrown by getLastBlock(). The existing code shown below looks incorrect: It first suppress unchecked warnings and then convert ClassCastException to an IOException. This makes it very hard to use it. How can the caller handle such IOException?
            //INodeFile
              <T extends BlockInfo> T getLastBlock() throws IOException {
                if (blocks == null || blocks.length == 0)
                  return null;
                T returnBlock = null;
                try {
                  @SuppressWarnings("unchecked")  // ClassCastException is caught below
                  T tBlock = (T)blocks[blocks.length - 1];
                  returnBlock = tBlock;
                } catch(ClassCastException cce) {
                  throw new IOException("Unexpected last block type: " 
                      + blocks[blocks.length - 1].getClass().getSimpleName());
                }
                return returnBlock;
              }
            
        Show
        Tsz Wo Nicholas Sze added a comment - In BlocksMap, + * Update the old block with the new block. + * + * The new block has a newer generation stamp so it requires remove + * the old entry first and reinsert the new entry + * + * @ return the removed stored block in the map + */ + BlockInfo updateBlock(Block oldBlock, Block newBlock) { + BlockInfo blockInfo = map.remove(oldBlock); + blockInfo.setGenerationStamp(newBlock.getGenerationStamp()); + blockInfo.setNumBytes(newBlock.getNumBytes()); + map.put(blockInfo, blockInfo); + return blockInfo; + } It is better to check oldBlock.getBlockId() == newBlock.getBlockId() or change updateBlock(..) to updateBlock(Block b, long newGenerationStamp, long newLength). The stored block is added back. So the javadoc "@return the removed stored block in the map" sounds incorrect. In FSNamesystem, @@ -1399,6 +1399,9 @@ // for (BlockInfo block: v.getBlocks()) { if (!blockManager.checkMinReplication(block)) { + NameNode.stateChangeLog.info( "BLOCK* NameSystem.checkFileProgress: " + + "block " + block + " has not reached minimal replication " + + blockManager.minReplication); return false ; } } @@ -1408,6 +1411,9 @@ // BlockInfo b = v.getPenultimateBlock(); if (b != null && !blockManager.checkMinReplication(b)) { + NameNode.stateChangeLog.info( "BLOCK* NameSystem.checkFileProgress: " + + "block " + b + " has not reached minimal replication " + + blockManager.minReplication); return false ; } } These two log messages does not look like "state changes". Should we use FSNamesystem.LOG instead? In FSNamesystem, - final BlockInfo oldblockinfo = pendingFile.getLastBlock(); + final BlockInfoUnderConstruction blockinfo = pendingFile.getLastBlock(); Could blockinfo be null? Is it the case that the last block must be a BlockInfoUnderConstruction? I am afarid that an IOException caused by a ClassCastException may be thrown by getLastBlock(). The existing code shown below looks incorrect: It first suppress unchecked warnings and then convert ClassCastException to an IOException. This makes it very hard to use it. How can the caller handle such IOException? //INodeFile <T extends BlockInfo> T getLastBlock() throws IOException { if (blocks == null || blocks.length == 0) return null ; T returnBlock = null ; try { @SuppressWarnings( "unchecked" ) // ClassCastException is caught below T tBlock = (T)blocks[blocks.length - 1]; returnBlock = tBlock; } catch (ClassCastException cce) { throw new IOException( "Unexpected last block type: " + blocks[blocks.length - 1].getClass().getSimpleName()); } return returnBlock; }
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12422310/loop.patch
        against trunk revision 825689.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/68/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/68/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/68/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/68/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12422310/loop.patch against trunk revision 825689. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/68/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/68/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/68/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/68/console This message is automatically generated.
        Hide
        Hairong Kuang added a comment -

        This patch fixes the bugs in both addStoredBlock and updatePipeline.

        Show
        Hairong Kuang added a comment - This patch fixes the bugs in both addStoredBlock and updatePipeline.
        Hide
        Hairong Kuang added a comment -

        The above proposal turned out to be quite tricky to get it right. So I instead chose to change only updatePipeline where the old stored block info is removed only from the hashmap in BlocksMap. After GS etc is updated, put it back into the hashMap. So all other properties of the block are preserved.

        Show
        Hairong Kuang added a comment - The above proposal turned out to be quite tricky to get it right. So I instead chose to change only updatePipeline where the old stored block info is removed only from the hashmap in BlocksMap. After GS etc is updated, put it back into the hashMap. So all other properties of the block are preserved.
        Hide
        Hairong Kuang added a comment -

        One solution to the above bug is to overwrite the first location list with valid finalized replicas in the second list when an under construction block is converted to a complete block.

        Show
        Hairong Kuang added a comment - One solution to the above bug is to overwrite the first location list with valid finalized replicas in the second list when an under construction block is converted to a complete block.
        Hide
        Hairong Kuang added a comment -

        What triggered the file close problem is that the datanode's "blockReceived" notification reached NameNode earlier than the client's pipeline update. Each under construction blick has two list of locations: the first is for keeping track of locations in a pipeline and the second is a triplets for tracking of finalized replica locations. On receiving of blockReceived, NameNode put the location into the block's triplets location. But when it later received the pipeline update, in order to handle the newer generation stamp, it removed the old under construction block entity from the blocks map and constructed a new one and added the new one back into the blocks map. However, the new block entity reset the second location list. That's why when it was time to close the file, NN checked the second list and complained there is no replica, so it refuse to close the file.

        Show
        Hairong Kuang added a comment - What triggered the file close problem is that the datanode's "blockReceived" notification reached NameNode earlier than the client's pipeline update. Each under construction blick has two list of locations: the first is for keeping track of locations in a pipeline and the second is a triplets for tracking of finalized replica locations. On receiving of blockReceived, NameNode put the location into the block's triplets location. But when it later received the pipeline update, in order to handle the newer generation stamp, it removed the old under construction block entity from the blocks map and constructed a new one and added the new one back into the blocks map. However, the new block entity reset the second location list. That's why when it was time to close the file, NN checked the second list and complained there is no replica, so it refuse to close the file.
        Hide
        Hairong Kuang added a comment -

        Even with Konstantin's quick fix, TC7 still occasionally hangs because it is not able to close the file. There is yet another bug. The below logs helped me pinpoint the problem:

        [junit] 2009-10-14 23:15:07,359 DEBUG hdfs.StateChange (NameNode.java:blockReceived(948)) - BLOCK NameNode.blockReceived: from 127.0.0.1:57020 1 blocks.
        [junit] 2009-10-14 23:15:07,364 DEBUG hdfs.StateChange (FSNamesystem.java:blockReceived(2657)) - BLOCK* NameSystem.blockReceived: blk_-3647824922726121286_1008 is received from 127.0.0.1:57020
        [junit] 2009-10-14 23:15:07,364 DEBUG namenode.FSNamesystem (DatanodeDescriptor.java:processReportedBlock(466)) - Reported block blk_-3647824922726121286_1008 on 127.0.0.1:57020 size 65536 replicaState = FINALIZED
        [junit] 2009-10-14 23:15:07,364 DEBUG namenode.FSNamesystem (DatanodeDescriptor.java:processReportedBlock(479)) - In memory blockUCState = UNDER_CONSTRUCTION
        [junit] 2009-10-14 23:15:07,365 WARN namenode.FSNamesystem (BlockManager.java:addStoredBlock(1029)) - Inconsistent size for block blk_-3647824922726121286_1008 reported from 127.0.0.1:57020 current size is 32768 reported size is 65536
        [junit] 2009-10-14 23:15:07,365 INFO hdfs.StateChange (BlockManager.java:addStoredBlock(1098)) - BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:57020 is added to BlockInfoUnderConstruction

        { [junit] blockUCState=UNDER_CONSTRUCTION [junit] replicas=[blk_-3647824922726121286_1007, blk_-3647824922726121286_1007] [junit] primaryNodeIndex=-1}

        size 65536
        [junit] 2009-10-14 23:15:07,365 INFO namenode.FSNamesystem (FSNamesystem.java:updatePipeline(3951)) - updatePipeline(block=blk_-3647824922726121286_1007, newGenerationStamp=1008, newLength=65536, newNodes=[127.0.0.1:57020], clientName=DFSClient_-1021096837)
        [junit] 2009-10-14 23:15:07,366 DEBUG hdfs.StateChange (FSDirectory.java:persistBlocks(321)) - DIR* FSDirectory.persistBlocks: /TC7/foo with 1 blocks is persisted to the file system
        [junit] 2009-10-14 23:15:07,369 INFO namenode.FSNamesystem (FSNamesystem.java:updatePipeline(4000)) - updatePipeline(blk_-3647824922726121286_1007) successfully to blk_-3647824922726121286_1008
        [junit] 2009-10-14 23:15:07,370 DEBUG hdfs.DFSClient (DFSClient.java:endBlock(2613)) - Closing old block blk_-3647824922726121286_1008
        ...
        [junit] 2009-10-14 23:15:12,317 DEBUG hdfs.StateChange (NameNode.java:complete(642)) - DIR NameNode.complete: /TC7/foo for DFSClient_-1021096837
        [junit] 2009-10-14 23:15:12,317 DEBUG hdfs.StateChange (FSNamesystem.java:completeFileInternal(1318)) - DIR* NameSystem.completeFile: /TC7/foo for DFSClient_-1021096837
        [junit] 2009-10-14 23:15:12,318 INFO hdfs.StateChange (FSNamesystem.java:checkFileProgress(1402)) - BLOCK* NameSystem.checkFileProgress: block BlockInfoUnderConstruction

        { [junit] blockUCState=COMMITTED [junit] replicas=[blk_-3647824922726121286_1008] [junit] primaryNodeIndex=-1}

        has not reached minimal replication 1
        [junit] 2009-10-14 23:15:12,720 INFO hdfs.DFSClient (DFSClient.java:completeFile(3634)) - Could not complete file /TC7/foo retrying...

        Show
        Hairong Kuang added a comment - Even with Konstantin's quick fix, TC7 still occasionally hangs because it is not able to close the file. There is yet another bug. The below logs helped me pinpoint the problem: [junit] 2009-10-14 23:15:07,359 DEBUG hdfs.StateChange (NameNode.java:blockReceived(948)) - BLOCK NameNode.blockReceived: from 127.0.0.1:57020 1 blocks. [junit] 2009-10-14 23:15:07,364 DEBUG hdfs.StateChange (FSNamesystem.java:blockReceived(2657)) - BLOCK* NameSystem.blockReceived: blk_-3647824922726121286_1008 is received from 127.0.0.1:57020 [junit] 2009-10-14 23:15:07,364 DEBUG namenode.FSNamesystem (DatanodeDescriptor.java:processReportedBlock(466)) - Reported block blk_-3647824922726121286_1008 on 127.0.0.1:57020 size 65536 replicaState = FINALIZED [junit] 2009-10-14 23:15:07,364 DEBUG namenode.FSNamesystem (DatanodeDescriptor.java:processReportedBlock(479)) - In memory blockUCState = UNDER_CONSTRUCTION [junit] 2009-10-14 23:15:07,365 WARN namenode.FSNamesystem (BlockManager.java:addStoredBlock(1029)) - Inconsistent size for block blk_-3647824922726121286_1008 reported from 127.0.0.1:57020 current size is 32768 reported size is 65536 [junit] 2009-10-14 23:15:07,365 INFO hdfs.StateChange (BlockManager.java:addStoredBlock(1098)) - BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:57020 is added to BlockInfoUnderConstruction { [junit] blockUCState=UNDER_CONSTRUCTION [junit] replicas=[blk_-3647824922726121286_1007, blk_-3647824922726121286_1007] [junit] primaryNodeIndex=-1} size 65536 [junit] 2009-10-14 23:15:07,365 INFO namenode.FSNamesystem (FSNamesystem.java:updatePipeline(3951)) - updatePipeline(block=blk_-3647824922726121286_1007, newGenerationStamp=1008, newLength=65536, newNodes= [127.0.0.1:57020] , clientName=DFSClient_-1021096837) [junit] 2009-10-14 23:15:07,366 DEBUG hdfs.StateChange (FSDirectory.java:persistBlocks(321)) - DIR* FSDirectory.persistBlocks: /TC7/foo with 1 blocks is persisted to the file system [junit] 2009-10-14 23:15:07,369 INFO namenode.FSNamesystem (FSNamesystem.java:updatePipeline(4000)) - updatePipeline(blk_-3647824922726121286_1007) successfully to blk_-3647824922726121286_1008 [junit] 2009-10-14 23:15:07,370 DEBUG hdfs.DFSClient (DFSClient.java:endBlock(2613)) - Closing old block blk_-3647824922726121286_1008 ... [junit] 2009-10-14 23:15:12,317 DEBUG hdfs.StateChange (NameNode.java:complete(642)) - DIR NameNode.complete: /TC7/foo for DFSClient_-1021096837 [junit] 2009-10-14 23:15:12,317 DEBUG hdfs.StateChange (FSNamesystem.java:completeFileInternal(1318)) - DIR* NameSystem.completeFile: /TC7/foo for DFSClient_-1021096837 [junit] 2009-10-14 23:15:12,318 INFO hdfs.StateChange (FSNamesystem.java:checkFileProgress(1402)) - BLOCK* NameSystem.checkFileProgress: block BlockInfoUnderConstruction { [junit] blockUCState=COMMITTED [junit] replicas=[blk_-3647824922726121286_1008] [junit] primaryNodeIndex=-1} has not reached minimal replication 1 [junit] 2009-10-14 23:15:12,720 INFO hdfs.DFSClient (DFSClient.java:completeFile(3634)) - Could not complete file /TC7/foo retrying...
        Hide
        Konstantin Shvachko added a comment -

        This was intended as a quick fix for resurrecting build process, but TestFileAppend3.TC7() fails any way with different diagnostics though.

        Show
        Konstantin Shvachko added a comment - This was intended as a quick fix for resurrecting build process, but TestFileAppend3.TC7() fails any way with different diagnostics though.
        Hide
        Konstantin Shvachko added a comment -

        In case of recovery from pipeline close failure the client

        1. requests a new GS from NN via NameNode.updateBlockForPipeline();
        2. sends the new GS to the remaining DNs via DataStreamer.createBlockOutputStream();
        3. notifies NN of establishing the new pipeline, which updates the block's GS to the new one via NameNode.updatePipeline()

        During (2) DNs may send addBlock() to NN, which may cause race condition with notification (3) from the client.
        You are right one solution is to ignore GS in look up for the replica in addBlock(). The best way to fix it that way is to implement HDFS-512.

        Another solution would be to set the new GS to the block in (1). That is NameNode.updateBlockForPipeline() will have to not only return the new GS, but also update the under-construction block with this GS. I checked the code and do not see problems with this approach so far.

        Show
        Konstantin Shvachko added a comment - In case of recovery from pipeline close failure the client requests a new GS from NN via NameNode.updateBlockForPipeline() ; sends the new GS to the remaining DNs via DataStreamer.createBlockOutputStream() ; notifies NN of establishing the new pipeline, which updates the block's GS to the new one via NameNode.updatePipeline() During (2) DNs may send addBlock() to NN, which may cause race condition with notification (3) from the client. You are right one solution is to ignore GS in look up for the replica in addBlock() . The best way to fix it that way is to implement HDFS-512 . Another solution would be to set the new GS to the block in (1). That is NameNode.updateBlockForPipeline() will have to not only return the new GS, but also update the under-construction block with this GS. I checked the code and do not see problems with this approach so far.
        Hide
        Hairong Kuang added a comment -

        In above case, a pipeline close recovery was performed to a block that was opened to append. The datanode bumped its replica's generation stamp (GS) and finalized the replica. It then notifies NN of the new replica. However, the datanode's notification reached NN before the client notifies NN of the replica's new GS and length. NN mistakenly treated this good replica as a bad one, a replica that does not belong to any file. Thus a file was not able to be close.

        The solution is not to match GS when searching for stored block info when adding a finalized replica.

        Show
        Hairong Kuang added a comment - In above case, a pipeline close recovery was performed to a block that was opened to append. The datanode bumped its replica's generation stamp (GS) and finalized the replica. It then notifies NN of the new replica. However, the datanode's notification reached NN before the client notifies NN of the replica's new GS and length. NN mistakenly treated this good replica as a bad one, a replica that does not belong to any file. Thus a file was not able to be close. The solution is not to match GS when searching for stored block info when adding a finalized replica.

          People

          • Assignee:
            Hairong Kuang
            Reporter:
            Hairong Kuang
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development