Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: HA branch (HDFS-1623)
    • Fix Version/s: HA branch (HDFS-1623)
    • Component/s: ha
    • Labels:
      None

      Description

      Without the patch in HDFS-1108, new block allocations aren't logged to the edits log. For HA, we'll need that functionality and we'll need to make sure that block locations aren't blown away in the Standby NN when tailing the edits log.

      As described in HDFS-1975:

      When we close a file, or add another block to a file, we write OP_CLOSE or OP_ADD in the txn log. FSEditLogLoader, when it sees these types of transactions, creates new BlockInfo objects for all of the blocks listed in the transaction. These new BlockInfos have no block locations associated. So, when we close a file, the SBNN loses its block locations info for that file and is no longer "hot".

      I have an ugly hack which copies over the old BlockInfos from the existing INode, but I'm not convinced it's the right way. It might be cleaner to add new opcode types like OP_ADD_ADDITIONAL_BLOCK, and actually treat OP_CLOSE as just a finalization of INodeFileUnderConstruction to INodeFile, rather than replacing block info at all.

      1. HDFS-2602.patch
        46 kB
        Aaron T. Myers
      2. HDFS-2602.patch
        46 kB
        Aaron T. Myers
      3. HDFS-2602.patch
        38 kB
        Aaron T. Myers
      4. HDFS-2602.patch
        38 kB
        Aaron T. Myers
      5. HDFS-2602.patch
        47 kB
        Aaron T. Myers

        Issue Links

          Activity

          Hide
          atm Aaron T. Myers added a comment -

          Here's a patch which includes all of the changes from the most recent patch posted on HDFS-1108. Additionally, this patch modifies FSEditLogLoader to do the following:

          • Separate the handling of OP_ADD and OP_CLOSE.
          • More clearly separate the file create vs. block add parts of OP_ADD.
          • Adds a new method, updateBlocks(), which updates the in-memory data structures of the NN on OP_CLOSE or OP_ADD in the case of adding a block.

          This patch also cleans up some of the file creation code, in particular that OP_ADD is no longer used to create directories, in lieu of OP_MKDIR.

          With this patch, TestStandbyIsHot passes.

          It wouldn't be difficult to separate the two cases of OP_ADD (file create and block addition) into separate opcodes, along the lines of OP_CREATE and OP_ADD_BLOCK, but there seems to be little benefit in doing so, except slight code clarity improvement. If people feel strongly I'd be happy to implement that.

          Show
          atm Aaron T. Myers added a comment - Here's a patch which includes all of the changes from the most recent patch posted on HDFS-1108 . Additionally, this patch modifies FSEditLogLoader to do the following: Separate the handling of OP_ADD and OP_CLOSE. More clearly separate the file create vs. block add parts of OP_ADD. Adds a new method, updateBlocks(), which updates the in-memory data structures of the NN on OP_CLOSE or OP_ADD in the case of adding a block. This patch also cleans up some of the file creation code, in particular that OP_ADD is no longer used to create directories, in lieu of OP_MKDIR. With this patch, TestStandbyIsHot passes. It wouldn't be difficult to separate the two cases of OP_ADD (file create and block addition) into separate opcodes, along the lines of OP_CREATE and OP_ADD_BLOCK, but there seems to be little benefit in doing so, except slight code clarity improvement. If people feel strongly I'd be happy to implement that.
          Hide
          atm Aaron T. Myers added a comment -

          I should have mentioned - I've run TestStandbyIsHot,TestPersistBlocks,TestAbandonBlock,TestClientProtocolForPipelineRecovery with this patch, and they all pass. I intend to run the full test suite on my box over the weekend.

          Show
          atm Aaron T. Myers added a comment - I should have mentioned - I've run TestStandbyIsHot,TestPersistBlocks,TestAbandonBlock,TestClientProtocolForPipelineRecovery with this patch, and they all pass. I intend to run the full test suite on my box over the weekend.
          Hide
          tlipcon Todd Lipcon added a comment -

          It looks like some other stuff slipped into this patchfile – eg the changes to TestStandbyIsHot to lower replication count and make sure that replicates correctly, and some changes to the DN and NN to make this new test pass. Let's leave that for a later issue and make this only address the problem as described in the issue Description

          Show
          tlipcon Todd Lipcon added a comment - It looks like some other stuff slipped into this patchfile – eg the changes to TestStandbyIsHot to lower replication count and make sure that replicates correctly, and some changes to the DN and NN to make this new test pass. Let's leave that for a later issue and make this only address the problem as described in the issue Description
          Hide
          atm Aaron T. Myers added a comment -

          Thanks a lot for the review, Todd. Here's an updated patch which removes those unrelated changes which snuck in there.

          Show
          atm Aaron T. Myers added a comment - Thanks a lot for the review, Todd. Here's an updated patch which removes those unrelated changes which snuck in there.
          Hide
          tlipcon Todd Lipcon added a comment -
          +  public static boolean sharedEditsDir(Configuration conf) {
          +    return null != conf.get(DFS_NAMENODE_SHARED_EDITS_DIR_KEY);
          

          I think this would be better named usesSharedEditsDir or isSharedEditsDirConfigured


                 for(Iterator<DatanodeDescriptor> it = blocksMap.nodeIterator(blk);
                     it.hasNext();) {
                   final DatanodeDescriptor d = it.next();
          +        assert d != null;
          

          I don't think this new assert is very useful. But let's keep the new one just below it.


          In getInodeFile:

          • better to capitalize this as{{getINodeFile}}, I think
          • no need for the local variable file here

          In updateBlocks:

          +    INodeFile file = (INodeFile)getInodeFile(fsDir, addCloseOp.path);
          

          Unnecesary cast

          +      if (oldBlock instanceof BlockInfoUnderConstruction) {
          +        fsNamesys.getBlockManager().forceCompleteBlock(
          +            (INodeFileUnderConstruction)file,
          +            (BlockInfoUnderConstruction)oldBlock);
          +      }
          

          What's the logic behind this? I think there are some cases in which there would be multiple OP_ADDs for the same file with the same blocks. For example, the client side hflush() call can trigger persistBlocks, but in the case of HA, the last block would already have been persisted by getAdditionalBlock as well. So the last block would be prematurely marked as COMPLETE in this case.

          +    } else if (addCloseOp.blocks.length > oldBlocks.length) {
          +      // We're adding a block
          +      assert addCloseOp.blocks.length - 1 == oldBlocks.length;
          

          Not sure this is quite right either - in the non-HA case, we could have the following sequence of events:

          • Client opens file (causes OP_ADD)
          • Client writes one block
          • Client writes half of second block
          • Client calls hflush() which triggers persistBlocks
            This will cause an OP_ADD that adds two blocks rather than just one.

          • Still has the new triggerDeletionReport code in DatanodeAdapter
          • Can you locally change the default for persistBlocks to true and run all of the unit tests? In particular want to make sure that stuff like lease recovery continues to work properly.
          • Did you run all the unit tests locally?
          Show
          tlipcon Todd Lipcon added a comment - + public static boolean sharedEditsDir(Configuration conf) { + return null != conf.get(DFS_NAMENODE_SHARED_EDITS_DIR_KEY); I think this would be better named usesSharedEditsDir or isSharedEditsDirConfigured for (Iterator<DatanodeDescriptor> it = blocksMap.nodeIterator(blk); it.hasNext();) { final DatanodeDescriptor d = it.next(); + assert d != null ; I don't think this new assert is very useful. But let's keep the new one just below it. In getInodeFile : better to capitalize this as{{getINodeFile}}, I think no need for the local variable file here In updateBlocks : + INodeFile file = (INodeFile)getInodeFile(fsDir, addCloseOp.path); Unnecesary cast + if (oldBlock instanceof BlockInfoUnderConstruction) { + fsNamesys.getBlockManager().forceCompleteBlock( + (INodeFileUnderConstruction)file, + (BlockInfoUnderConstruction)oldBlock); + } What's the logic behind this? I think there are some cases in which there would be multiple OP_ADDs for the same file with the same blocks. For example, the client side hflush() call can trigger persistBlocks , but in the case of HA, the last block would already have been persisted by getAdditionalBlock as well. So the last block would be prematurely marked as COMPLETE in this case. + } else if (addCloseOp.blocks.length > oldBlocks.length) { + // We're adding a block + assert addCloseOp.blocks.length - 1 == oldBlocks.length; Not sure this is quite right either - in the non-HA case, we could have the following sequence of events: Client opens file (causes OP_ADD) Client writes one block Client writes half of second block Client calls hflush() which triggers persistBlocks This will cause an OP_ADD that adds two blocks rather than just one. Still has the new triggerDeletionReport code in DatanodeAdapter Can you locally change the default for persistBlocks to true and run all of the unit tests? In particular want to make sure that stuff like lease recovery continues to work properly. Did you run all the unit tests locally?
          Hide
          atm Aaron T. Myers added a comment -

          Thanks again for the review, Eli. I agree that that synchronization is unnecessary. Here's an updated patch which removes that sync.

          I'm going to commit this momentarily unless there are further objections.

          Show
          atm Aaron T. Myers added a comment - Thanks again for the review, Eli. I agree that that synchronization is unnecessary. Here's an updated patch which removes that sync. I'm going to commit this momentarily unless there are further objections.
          Hide
          atm Aaron T. Myers added a comment -

          Whoops! Meant to attach that patch and make that comment on HADOOP-7896. Please disregard the previous comment.

          Show
          atm Aaron T. Myers added a comment - Whoops! Meant to attach that patch and make that comment on HADOOP-7896 . Please disregard the previous comment.
          Hide
          atm Aaron T. Myers added a comment -

          Thanks a lot for the review, Todd. Here's an updated patch which should address all of your comments.

          Show
          atm Aaron T. Myers added a comment - Thanks a lot for the review, Todd. Here's an updated patch which should address all of your comments.
          Hide
          shv Konstantin Shvachko added a comment -

          I was observing the same issue. And came into conclusion of solving it the same way with convertToInodeFile().
          I also see that my comments in HDFS-1108 have been incorporated. So conceptually it all looks good.
          I just don't understand why this block persisting issue keeps migrating from HDFS-978 to HDFS-1108 and now here? Are you seriously measuring performance in # of jiras (or lines of code).

          Show
          shv Konstantin Shvachko added a comment - I was observing the same issue. And came into conclusion of solving it the same way with convertToInodeFile(). I also see that my comments in HDFS-1108 have been incorporated. So conceptually it all looks good. I just don't understand why this block persisting issue keeps migrating from HDFS-978 to HDFS-1108 and now here? Are you seriously measuring performance in # of jiras (or lines of code).
          Hide
          tlipcon Todd Lipcon added a comment -

          I just don't understand why this block persisting issue keeps migrating from HDFS-978 to HDFS-1108 and now here? Are you seriously measuring performance in # of jiras (or lines of code).

          Funny

          Dhruba opened both HDFS-978 and HDFS-1108, not sure why. It ended up migrating here because at first we considered it separate, and were building this patch on top of HDFS-1108. Then by the time the patch was done it seemed to make more sense to integrate them. Would be reasonable to commit them separately, though - 1108 for the logging and this one for the FSEditLog changes to avoid losing the block locations. Would you prefer that or can we just commit this as is once it's reviewed?

          Show
          tlipcon Todd Lipcon added a comment - I just don't understand why this block persisting issue keeps migrating from HDFS-978 to HDFS-1108 and now here? Are you seriously measuring performance in # of jiras (or lines of code). Funny Dhruba opened both HDFS-978 and HDFS-1108 , not sure why. It ended up migrating here because at first we considered it separate, and were building this patch on top of HDFS-1108 . Then by the time the patch was done it seemed to make more sense to integrate them. Would be reasonable to commit them separately, though - 1108 for the logging and this one for the FSEditLog changes to avoid losing the block locations. Would you prefer that or can we just commit this as is once it's reviewed?
          Hide
          atm Aaron T. Myers added a comment -

          I also see that my comments in HDFS-1108 have been incorporated. So conceptually it all looks good.

          This patch literally includes the latest patch you posted on HDFS-1108. I just didn't want to wait for that to get committed before making progress on this JIRA.

          As Todd said, it wouldn't be too difficult to separate them, but they seem pretty naturally one unit. I'd personally prefer to just commit this as-is, but if you feel strongly, Konst, I can separate them.

          Show
          atm Aaron T. Myers added a comment - I also see that my comments in HDFS-1108 have been incorporated. So conceptually it all looks good. This patch literally includes the latest patch you posted on HDFS-1108 . I just didn't want to wait for that to get committed before making progress on this JIRA. As Todd said, it wouldn't be too difficult to separate them, but they seem pretty naturally one unit. I'd personally prefer to just commit this as-is, but if you feel strongly, Konst, I can separate them.
          Hide
          shv Konstantin Shvachko added a comment -

          The issues are tightly related, and since it is already reviewed let's commit it here. May be change the title to reflect its more general purpose.

          Show
          shv Konstantin Shvachko added a comment - The issues are tightly related, and since it is already reviewed let's commit it here. May be change the title to reflect its more general purpose.
          Hide
          atm Aaron T. Myers added a comment -

          Thanks a lot, Konstantin. I've updated the summary and description to more accurately reflect the changes being made in this JIRA.

          Show
          atm Aaron T. Myers added a comment - Thanks a lot, Konstantin. I've updated the summary and description to more accurately reflect the changes being made in this JIRA.
          Hide
          tlipcon Todd Lipcon added a comment -
          • can updateBlocks take the INodeFile object itself, rather than the path? In both cases, we already have the INodeFile object, so the lookup is redundant
          • in updateBlocks, in the case that the last block in the file is updating its generation stamp, don't you need to call oldBlock.setGenerationStamp(newBlock.getGenerationStamp()? Does append work correctly cross restart?
          • style nit: i++ instead of i ++ in the "adding blocks" case
          Show
          tlipcon Todd Lipcon added a comment - can updateBlocks take the INodeFile object itself, rather than the path? In both cases, we already have the INodeFile object, so the lookup is redundant in updateBlocks, in the case that the last block in the file is updating its generation stamp, don't you need to call oldBlock.setGenerationStamp(newBlock.getGenerationStamp() ? Does append work correctly cross restart? style nit: i++ instead of i ++ in the "adding blocks" case
          Hide
          tlipcon Todd Lipcon added a comment -

          it looks like TestLeaseRecovery2's restart-related tests are timing out/failing with this patch - perhaps due to the genstamp issue mentioned above?

          Show
          tlipcon Todd Lipcon added a comment - it looks like TestLeaseRecovery2's restart-related tests are timing out/failing with this patch - perhaps due to the genstamp issue mentioned above?
          Hide
          tlipcon Todd Lipcon added a comment -

          Also TestEditLog fails since it logs alternating OP_ADD and OP_CLOSE for the same file. I don't know if it's an unrealistic test or an actual issue – but I think what's happening is this:

          • OP_ADD creates a new INodeFileUnderConstruction
          • OP_CLOSE convers to INodeFile
          • OP_ADD sees an already-existing file, and just does updateBlocks without converting back to INodeFileUnderConstruction
          • OP_CLOSE fails because it's trying to close a non-underconstruction file

          Doesn't this happen in the append() case?

          Show
          tlipcon Todd Lipcon added a comment - Also TestEditLog fails since it logs alternating OP_ADD and OP_CLOSE for the same file. I don't know if it's an unrealistic test or an actual issue – but I think what's happening is this: OP_ADD creates a new INodeFileUnderConstruction OP_CLOSE convers to INodeFile OP_ADD sees an already-existing file, and just does updateBlocks without converting back to INodeFileUnderConstruction OP_CLOSE fails because it's trying to close a non-underconstruction file Doesn't this happen in the append() case?
          Hide
          atm Aaron T. Myers added a comment -

          Thanks a ton for the thorough review, Todd. Here's a patch which should address all of your comments.

          I had to amend TestEditLog slightly. The test case it was trying to do is legitimate (open/close on the same file repeatedly) but it had a bug which could result in multiple consecutive opens or closes being logged. This patch fixes that issue as well.

          Show
          atm Aaron T. Myers added a comment - Thanks a ton for the thorough review, Todd. Here's a patch which should address all of your comments. I had to amend TestEditLog slightly. The test case it was trying to do is legitimate (open/close on the same file repeatedly) but it had a bug which could result in multiple consecutive opens or closes being logged. This patch fixes that issue as well.
          Hide
          tlipcon Todd Lipcon added a comment -

          Looks good... one small request: can you change the assertion in the "abandon block" case in updateBlocks to a proper if statement that throws an IOE? I think in this section of the code we'd rather be safe and throw an exception than risk replaying the edits incorrectly

          Show
          tlipcon Todd Lipcon added a comment - Looks good... one small request: can you change the assertion in the "abandon block" case in updateBlocks to a proper if statement that throws an IOE? I think in this section of the code we'd rather be safe and throw an exception than risk replaying the edits incorrectly
          Hide
          atm Aaron T. Myers added a comment -

          Here's a patch which converts those asserts into proper exceptions with error messages.

          Show
          atm Aaron T. Myers added a comment - Here's a patch which converts those asserts into proper exceptions with error messages.
          Hide
          tlipcon Todd Lipcon added a comment -

          +1, thanks for wrestling this one to the ground.

          Show
          tlipcon Todd Lipcon added a comment - +1, thanks for wrestling this one to the ground.
          Hide
          atm Aaron T. Myers added a comment -

          Thanks a lot for your many reviews, Todd. I've just committed this to the HA branch.

          Show
          atm Aaron T. Myers added a comment - Thanks a lot for your many reviews, Todd. I've just committed this to the HA branch.
          Hide
          hudson Hudson added a comment -

          Integrated in Hadoop-Hdfs-HAbranch-build #18 (See https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/18/)
          HDFS-2602. NN should log newly-allocated blocks without losing BlockInfo. Contributed by Aaron T. Myers

          atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1215036
          Files :

          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/CHANGES.HDFS-1623.txt
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/HAUtil.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSDirectory.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogLoader.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/LeaseManager.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/PendingDataNodeMessages.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/EditLogTailer.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestPersistBlocks.java
          • /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLog.java
          Show
          hudson Hudson added a comment - Integrated in Hadoop-Hdfs-HAbranch-build #18 (See https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/18/ ) HDFS-2602 . NN should log newly-allocated blocks without losing BlockInfo. Contributed by Aaron T. Myers atm : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1215036 Files : /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/CHANGES. HDFS-1623 .txt /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/HAUtil.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSDirectory.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogLoader.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/LeaseManager.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/PendingDataNodeMessages.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/EditLogTailer.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestPersistBlocks.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLog.java

            People

            • Assignee:
              atm Aaron T. Myers
              Reporter:
              tlipcon Todd Lipcon
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development