Details

    • Target Version/s:

      Description

      The current HDFS design says that newly allocated blocks for a file are not persisted in the NN transaction log when the block is allocated. Instead, a hflush() or a close() on the file persists the blocks into the transaction log. It would be nice if we can immediately persist newly allocated blocks (as soon as they are allocated) for specific files.

      1. HDFS-1108.patch
        6 kB
        Dmytro Molkov
      2. hdfs-1108-habranch.txt
        9 kB
        Todd Lipcon
      3. hdfs-1108.txt
        10 kB
        Todd Lipcon
      4. hdfs-1108-habranch.txt
        10 kB
        Eli Collins
      5. hdfs-1108-habranch.txt
        10 kB
        Todd Lipcon
      6. hdfs-1108-habranch.txt
        11 kB
        Aaron T. Myers
      7. hdfs-1108-habranch.txt
        11 kB
        Konstantin Shvachko
      8. hdfs-1108-hadoop-1.patch
        12 kB
        Sanjay Radia
      9. hdfs-1108-hadoop-1-v2.patch
        16 kB
        Sanjay Radia
      10. hdfs-1108-hadoop-1-v3.patch
        17 kB
        Sanjay Radia
      11. hdfs-1108-hadoop-1-v4.patch
        17 kB
        Sanjay Radia
      12. hdfs-1108-hadoop-1-v5.patch
        17 kB
        Sanjay Radia

        Issue Links

          Activity

          Hide
          Matt Foley added a comment -

          Closed upon release of 1.1.1.

          Show
          Matt Foley added a comment - Closed upon release of 1.1.1.
          Hide
          Matt Foley added a comment -

          Merged to branch-1.1.

          Show
          Matt Foley added a comment - Merged to branch-1.1.
          Hide
          Matt Foley added a comment -

          Incorrectly resolved as duplicate. The OTHER bug (HDFS-978) was the one closed. The fix was committed under this Jira ID.

          Show
          Matt Foley added a comment - Incorrectly resolved as duplicate. The OTHER bug ( HDFS-978 ) was the one closed. The fix was committed under this Jira ID.
          Hide
          Sanjay Radia added a comment -

          This is useful for restarting a failed NN - with manual or automatic restart.
          Hence yes, it is useful for HA in Hadoop 1.

          Show
          Sanjay Radia added a comment - This is useful for restarting a failed NN - with manual or automatic restart. Hence yes, it is useful for HA in Hadoop 1.
          Hide
          Eli Collins added a comment -

          Hey Sanjay,
          What's the use case for this in Hadoop 1.x, some form of v1 HA failover?

          Show
          Eli Collins added a comment - Hey Sanjay, What's the use case for this in Hadoop 1.x, some form of v1 HA failover?
          Hide
          Sanjay Radia added a comment -

          Committed to branch 1

          Show
          Sanjay Radia added a comment - Committed to branch 1
          Hide
          Sanjay Radia added a comment -

          Updated patch for Hadoop 1
          Modified TestPersistBlocks to use both flush and sync as per comments.

          Show
          Sanjay Radia added a comment - Updated patch for Hadoop 1 Modified TestPersistBlocks to use both flush and sync as per comments.
          Hide
          Sanjay Radia added a comment -

          @Todd, I was thinking of exactly the same - cover both flush and hflush. Will post an updated patch tomorrow. Thanks

          Show
          Sanjay Radia added a comment - @Todd, I was thinking of exactly the same - cover both flush and hflush. Will post an updated patch tomorrow. Thanks
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > Nicholas, if you look at the trunk patches, you will see the same.

          I see. +1 on the patch.

          Show
          Tsz Wo Nicholas Sze added a comment - > Nicholas, if you look at the trunk patches, you will see the same. I see. +1 on the patch.
          Hide
          Todd Lipcon added a comment -

          Hey Sanjay. I looked at the flush() vs hflush() change that you mentioned. It looked like it changed between the attachment #12507073 and #12507492 on HDFS-2602, so I talked to Aaron about it. We determined that he probably changed it due to my review comment:

          What's the logic behind this? I think there are some cases in which there would be multiple OP_ADDs for the same file with the same blocks. For example, the client side hflush() call can trigger persistBlocks, but in the case of HA, the last block would already have been persisted by getAdditionalBlock as well. So the last block would be prematurely marked as COMPLETE in this case.

          So, I agree with you that we should change the hflush() back to flush(), but we should also add another test case which does use hflush, so that we test the case where the blocks are auto-logged due to the PERSIST_BLOCKS=true setting, and then logged again due to hflush(). Perhaps we can add a flag to the test like boolean useHflush and run it with both false and true. How does that sound?

          Show
          Todd Lipcon added a comment - Hey Sanjay. I looked at the flush() vs hflush() change that you mentioned. It looked like it changed between the attachment #12507073 and #12507492 on HDFS-2602 , so I talked to Aaron about it. We determined that he probably changed it due to my review comment: What's the logic behind this? I think there are some cases in which there would be multiple OP_ADDs for the same file with the same blocks. For example, the client side hflush() call can trigger persistBlocks, but in the case of HA, the last block would already have been persisted by getAdditionalBlock as well. So the last block would be prematurely marked as COMPLETE in this case. So, I agree with you that we should change the hflush() back to flush(), but we should also add another test case which does use hflush, so that we test the case where the blocks are auto-logged due to the PERSIST_BLOCKS=true setting, and then logged again due to hflush(). Perhaps we can add a flag to the test like boolean useHflush and run it with both false and true. How does that sound?
          Hide
          Sanjay Radia added a comment -

          Updated patch - testEarlierVersionEditLog backported.

          Show
          Sanjay Radia added a comment - Updated patch - testEarlierVersionEditLog backported.
          Hide
          Sanjay Radia added a comment -

          Nicholas, if you look at the trunk patches, you will see the same.

          Show
          Sanjay Radia added a comment - Nicholas, if you look at the trunk patches, you will see the same.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Hi Sanjay,

          +    dir.persistBlocks(src, file);
          +    if (persistBlocks) {
          +      getEditLog().logSync();
          +    }
          

          Why dir.persistBlocks(src, file) is not inside the then-clause of the if-statement?

          Show
          Tsz Wo Nicholas Sze added a comment - Hi Sanjay, + dir.persistBlocks(src, file); + if (persistBlocks) { + getEditLog().logSync(); + } Why dir.persistBlocks(src, file) is not inside the then-clause of the if-statement?
          Hide
          Sanjay Radia added a comment -

          Port of patch for Hadoop-1.

          Note TestPersistBlocks() has 2 bugs in trunk.
          The original patch (hdfs-1108.txt) called stream.flush() instead of stream.hflush(). Flush() was correct since hflush() already forces the block allocation to be sync'ed to the log.

          The patch for ha-branch (hdfs-1108- habranch.txt) changed this as it introduced additional tests. Hence the test is not really testing the problem that this jira fixes.

          In my port to hadoop-1, I ran TestPersistBlock without the fix to FSNamesystem and tried both sync(ie hflush) and flush(). The tests passed with sync() but not with flush().
          Note both testRestartDfs testRestartWithPartialBlockHflushed should
          call flush() and not sync() after writing DATA_BEFORE_RESTART.

          I will file a jira to fix this in trunk.

          Show
          Sanjay Radia added a comment - Port of patch for Hadoop-1. Note TestPersistBlocks() has 2 bugs in trunk. The original patch (hdfs-1108.txt) called stream.flush() instead of stream.hflush(). Flush() was correct since hflush() already forces the block allocation to be sync'ed to the log. The patch for ha-branch (hdfs-1108- habranch.txt) changed this as it introduced additional tests. Hence the test is not really testing the problem that this jira fixes. In my port to hadoop-1, I ran TestPersistBlock without the fix to FSNamesystem and tried both sync(ie hflush) and flush(). The tests passed with sync() but not with flush(). Note both testRestartDfs testRestartWithPartialBlockHflushed should call flush() and not sync() after writing DATA_BEFORE_RESTART. I will file a jira to fix this in trunk.
          Hide
          Todd Lipcon added a comment -

          Resolving this one as duplicate since it got incorporated into HDFS-2602

          Show
          Todd Lipcon added a comment - Resolving this one as duplicate since it got incorporated into HDFS-2602
          Hide
          Konstantin Shvachko added a comment -

          Why is it patch available if it is not for trunk.

          Show
          Konstantin Shvachko added a comment - Why is it patch available if it is not for trunk.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12506412/hdfs-1108-habranch.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/1654//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12506412/hdfs-1108-habranch.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/1654//console This message is automatically generated.
          Hide
          Konstantin Shvachko added a comment -

          Two suggestions.

          1. dfs.persist.blocks should control whether the block(s) are logSync() ed. While persistBlocks() should be called unconditionally.
          2. When a file is closing the blocks should be logSync() ed unconditionally. This is current behavior, which should be retained.

          For BackupNode approach it is important that blocks are persisted, that is sent to the backup stream. The the backup stream can then take care of delivering the transaction to the BackupNode. If persistBlock() is not called BackupNode will not have any way to know about the blocks being created.
          logSync() is not required for BackupNode, but is required for your shared storage solution.

          I think the patch should work for both approaches.

          Show
          Konstantin Shvachko added a comment - Two suggestions. dfs.persist.blocks should control whether the block(s) are logSync() ed. While persistBlocks() should be called unconditionally. When a file is closing the blocks should be logSync() ed unconditionally. This is current behavior, which should be retained. For BackupNode approach it is important that blocks are persisted, that is sent to the backup stream. The the backup stream can then take care of delivering the transaction to the BackupNode. If persistBlock() is not called BackupNode will not have any way to know about the blocks being created. logSync() is not required for BackupNode, but is required for your shared storage solution. I think the patch should work for both approaches.
          Hide
          Eli Collins added a comment -

          Per above we can remove the conf option since we only need to do this if append or shared edits dir is enabled right? Can modify the test to only run if that's the case (or perhaps check the inverse behavior if appends or shared edits are not enabled).

          Show
          Eli Collins added a comment - Per above we can remove the conf option since we only need to do this if append or shared edits dir is enabled right? Can modify the test to only run if that's the case (or perhaps check the inverse behavior if appends or shared edits are not enabled).
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12505835/hdfs-1108-habranch.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/1627//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12505835/hdfs-1108-habranch.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/1627//console This message is automatically generated.
          Hide
          Aaron T. Myers added a comment -

          Here's an updated patch which adds back in the dfs.persist.blocks config and changes the test to use it. Persistence of blocks is also enabled automatically in the case HA is enabled with shared edits dir, regardless of the value of dfs.persist.blocks.

          Show
          Aaron T. Myers added a comment - Here's an updated patch which adds back in the dfs.persist.blocks config and changes the test to use it. Persistence of blocks is also enabled automatically in the case HA is enabled with shared edits dir, regardless of the value of dfs.persist.blocks .
          Hide
          Todd Lipcon added a comment -

          oh, duh... you're right...

          I think we should add back the configuration which allows it to persist blocks even if HA is not enabled. As the tests demonstrates, it's necessary for "fast restart" as a form of HA for tiny clusters. It will also allow this test to work without setting up the entire HA infrastructure.

          Show
          Todd Lipcon added a comment - oh, duh... you're right... I think we should add back the configuration which allows it to persist blocks even if HA is not enabled. As the tests demonstrates, it's necessary for "fast restart" as a form of HA for tiny clusters. It will also allow this test to work without setting up the entire HA infrastructure.
          Hide
          Eli Collins added a comment -

          Doesn't the test need to enable HA and shared edits dir or is it only covering the syncs triggered by supportAppends?

          Show
          Eli Collins added a comment - Doesn't the test need to enable HA and shared edits dir or is it only covering the syncs triggered by supportAppends?
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12505289/hdfs-1108-habranch.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/1610//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12505289/hdfs-1108-habranch.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/1610//console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          Updated patch on top of HDFS-2582 - same patch except for trivial change that we need to pass the namespace id to HAUtil.isHaEnabled now. After HDFS-2582 goes in, I'll commit this based on the earlier +1s from folks

          Show
          Todd Lipcon added a comment - Updated patch on top of HDFS-2582 - same patch except for trivial change that we need to pass the namespace id to HAUtil.isHaEnabled now. After HDFS-2582 goes in, I'll commit this based on the earlier +1s from folks
          Hide
          Konstantin Shvachko added a comment -

          I remove my veto. Thanks for addressing my concerns.

          Show
          Konstantin Shvachko added a comment - I remove my veto. Thanks for addressing my concerns.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12501211/hdfs-1108-habranch.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/1483//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12501211/hdfs-1108-habranch.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/1483//console This message is automatically generated.
          Hide
          Eli Collins added a comment -

          Updated patch attached for the HDFS-1623 branch. It removes the new configuration option - only append and shared-edits-dir based HA, using the config flag introduced in HDFS-1975 - trigger block persistence. In shared-edits-dir based HA the log and sync are needed to ensure the standby state is consistent in the case of fail-over.

          This does not preclude adding persistent, guaranteed streams in the future, and introduces no additional overhead for the BN-based approach. Similarly, we can unconditionally log blocks in the future if benchmarking doesn't show significant impact.

          Reasonable?

          Show
          Eli Collins added a comment - Updated patch attached for the HDFS-1623 branch. It removes the new configuration option - only append and shared-edits-dir based HA, using the config flag introduced in HDFS-1975 - trigger block persistence. In shared-edits-dir based HA the log and sync are needed to ensure the standby state is consistent in the case of fail-over. This does not preclude adding persistent, guaranteed streams in the future, and introduces no additional overhead for the BN-based approach. Similarly, we can unconditionally log blocks in the future if benchmarking doesn't show significant impact. Reasonable?
          Hide
          Eli Collins added a comment -

          The design doc allows for and discusses multiple approaches, specific implementations are happening in the jiras, eg see HDFS-1975 for sharing NN state.

          Show
          Eli Collins added a comment - The design doc allows for and discusses multiple approaches, specific implementations are happening in the jiras, eg see HDFS-1975 for sharing NN state.
          Hide
          Konstantin Shvachko added a comment -

          Eli, the latest design doc does not make any decisions on the matter, your excerpt is from the oldest doc.
          Since the discussion turns to more general topic I posted my concerns in the main jira.

          Show
          Konstantin Shvachko added a comment - Eli, the latest design doc does not make any decisions on the matter, your excerpt is from the oldest doc. Since the discussion turns to more general topic I posted my concerns in the main jira.
          Hide
          Eli Collins added a comment -

          The first "we" refers to the authors of the 1623 design doc, the 2nd "we" refers to the people working on 1623 sub-tasks.

          Show
          Eli Collins added a comment - The first "we" refers to the authors of the 1623 design doc, the 2nd "we" refers to the people working on 1623 sub-tasks.
          Hide
          Konstantin Shvachko added a comment -

          Eli, "we" is ambiguous, could you please clarify.

          Show
          Konstantin Shvachko added a comment - Eli, "we" is ambiguous, could you please clarify.
          Hide
          Eli Collins added a comment -

          Konstantin, please checkout the latest design doc in HDFS-1623, the relevant section:

          We see the benefits of streaming edits to the Standby (such an implementation exists in HDFS 21). However we believe that it appears to be simpler to initially use shared storage between the primary and secondary.

          This section goes on to outline the rationale. Please also see HDFS-1975 which also states "the proposed solution in this jira is to use a shared storage to store the namenode state." in the description.

          In short, it should be no surprise that we're going for the shared storage approach initially, however the work is not being done in a way that precludes alternative implementations.

          Show
          Eli Collins added a comment - Konstantin, please checkout the latest design doc in HDFS-1623 , the relevant section: We see the benefits of streaming edits to the Standby (such an implementation exists in HDFS 21). However we believe that it appears to be simpler to initially use shared storage between the primary and secondary. This section goes on to outline the rationale. Please also see HDFS-1975 which also states "the proposed solution in this jira is to use a shared storage to store the namenode state." in the description. In short, it should be no surprise that we're going for the shared storage approach initially, however the work is not being done in a way that precludes alternative implementations.
          Hide
          Todd Lipcon added a comment -

          In the hflush case, the client calls namenode.fsync() the first time hflush() is called on a new block. This causes the blocks to get persisted, and does call logSync() already today.

          Show
          Todd Lipcon added a comment - In the hflush case, the client calls namenode.fsync() the first time hflush() is called on a new block. This causes the blocks to get persisted, and does call logSync() already today.
          Hide
          Suresh Srinivas added a comment -

          Suresh, based on your comments, you should vote -1 on this patch, because this patch calls persistBlocks only when supportAppends or hasHA. So, why not enable it everytime, unless a benchmark shows serious regression ?

          I do not -1 when the discussion is still in progress I agree we should understand the cost of persisting irrespective of HA or not. I also think eventually (once 0.23 is stable enough) supportAppends by default could be set to true and the optimization may not be effective, in default case.

          Suresh, yes you do repeat it. But you never answered MY question, which HA approach are you implementing. As you see you have to make choices even with issues that seemed to be common part for all approaches.

          I thought it was clear. I would like to implement HA with shared approach and not dependent on IP failover.

          I like Milind's idea about an implementation "without shared storage assumption".

          If that is the case, then lets also remove BackupNode from the picture, to remove the argument, the editlog is persisted in BackupNode's memory.

          Removing HA out of the question, isn't not persisting block allocation an issue even with new append, in the following scenario:

          1. A block is allocated on an NN
          2. Client writes starts writing to a block and performs flush.
          3. NN at this time restarts and has no idea about this new block. During lease recovery, it closes the file (with no UnderConstruction block).

          Will the above scenario result in loss of data?

          Show
          Suresh Srinivas added a comment - Suresh, based on your comments, you should vote -1 on this patch, because this patch calls persistBlocks only when supportAppends or hasHA. So, why not enable it everytime, unless a benchmark shows serious regression ? I do not -1 when the discussion is still in progress I agree we should understand the cost of persisting irrespective of HA or not. I also think eventually (once 0.23 is stable enough) supportAppends by default could be set to true and the optimization may not be effective, in default case. Suresh, yes you do repeat it. But you never answered MY question, which HA approach are you implementing. As you see you have to make choices even with issues that seemed to be common part for all approaches. I thought it was clear. I would like to implement HA with shared approach and not dependent on IP failover. I like Milind's idea about an implementation "without shared storage assumption". If that is the case, then lets also remove BackupNode from the picture, to remove the argument, the editlog is persisted in BackupNode's memory. Removing HA out of the question, isn't not persisting block allocation an issue even with new append, in the following scenario: A block is allocated on an NN Client writes starts writing to a block and performs flush. NN at this time restarts and has no idea about this new block. During lease recovery, it closes the file (with no UnderConstruction block). Will the above scenario result in loss of data?
          Hide
          Konstantin Shvachko added a comment -

          I and several others am working on approach #2

          Thanks Todd for clarifying this. Could you please also share the design document for your approach? I would like to learn the details and understand why you choose an approach which at this point does not seem to me optimal for the project.

          option 1 causes dataloss regardless of your opinions on HA

          This is not a data loss, Todd. This is a tradeoff between performance and the persistence of data. Having flush and sync one can control when to choose performance in favor of guaranteed persistence and when vice versa. Regardless of my opinion on HA
          Is anybody asking to remove this flexibility?

          How do you differentiate between logSync() to that stream Vs stream to the disk?

          I do not differentiate between streams. As I said addBlock() transaction should be treated the same way as setTimes(), that is it is logged (and batched) but not synced. There is no consistency issue here. Transactions will eventually be committed to the journal by another sync-able transaction or by a file close().

          The approach 2 has different requirements that many are interested in. I have repeated this many times.

          Suresh, yes you do repeat it. But you never answered MY question, which HA approach are you implementing. As you see you have to make choices even with issues that seemed to be common part for all approaches.

          I like Milind's idea about an implementation "without shared storage assumption".

          Sticking to the point.

          • Without HA consideration this patch removes flexibility to choose between performance and guaranteed persistence of data
          • There should be a good reason for that
          • HA solution with shared storage seems to be the reason
          • The community has not seen the design, hasn't discussed it, and is not aware of why this one is better than other three (or was it four) published in different jiras.
          Show
          Konstantin Shvachko added a comment - I and several others am working on approach #2 Thanks Todd for clarifying this. Could you please also share the design document for your approach? I would like to learn the details and understand why you choose an approach which at this point does not seem to me optimal for the project. option 1 causes dataloss regardless of your opinions on HA This is not a data loss, Todd. This is a tradeoff between performance and the persistence of data. Having flush and sync one can control when to choose performance in favor of guaranteed persistence and when vice versa. Regardless of my opinion on HA Is anybody asking to remove this flexibility? How do you differentiate between logSync() to that stream Vs stream to the disk? I do not differentiate between streams. As I said addBlock() transaction should be treated the same way as setTimes(), that is it is logged (and batched) but not synced. There is no consistency issue here. Transactions will eventually be committed to the journal by another sync-able transaction or by a file close(). The approach 2 has different requirements that many are interested in. I have repeated this many times. Suresh, yes you do repeat it. But you never answered MY question, which HA approach are you implementing. As you see you have to make choices even with issues that seemed to be common part for all approaches. I like Milind's idea about an implementation "without shared storage assumption". Sticking to the point. Without HA consideration this patch removes flexibility to choose between performance and guaranteed persistence of data There should be a good reason for that HA solution with shared storage seems to be the reason The community has not seen the design, hasn't discussed it, and is not aware of why this one is better than other three (or was it four) published in different jiras.
          Hide
          Milind Bhandarkar added a comment -

          Sorry to come to this party late.

          Stepping back a little bit, creation of a block has to be a two-phase transaction, show the intent to create a block (and log it), write data (and note that it is being written), and close the block (commit it), in order to recover from failures (whether in a single NN fail-manual restart mode, or fail-automatic restart mode, or fail-backup restart mode). So, no one is contesting that intent to create block, and committing the block both need to be logged. (data being written need not be logged because it can be merged with the intent to create block, without any availability degradation.) Now, the only issue that remains to be discussed, is, is the code assuming that the logs are written to shared storage specifically, or if it generalizes and writes to stream, which is guaranteed to be persisted by the stream broker.

          I think the latter is where everyone in the community would like to be. All that the patch contributor needs to make sure that this eventual goal is not blocked by current changes.

          Todd, do you think one could enable persistent, guaranteed streams, without shared storage assumption with this patch ?

          Show
          Milind Bhandarkar added a comment - Sorry to come to this party late. Stepping back a little bit, creation of a block has to be a two-phase transaction, show the intent to create a block (and log it), write data (and note that it is being written), and close the block (commit it), in order to recover from failures (whether in a single NN fail-manual restart mode, or fail-automatic restart mode, or fail-backup restart mode). So, no one is contesting that intent to create block, and committing the block both need to be logged. (data being written need not be logged because it can be merged with the intent to create block, without any availability degradation.) Now, the only issue that remains to be discussed, is, is the code assuming that the logs are written to shared storage specifically, or if it generalizes and writes to stream, which is guaranteed to be persisted by the stream broker. I think the latter is where everyone in the community would like to be. All that the patch contributor needs to make sure that this eventual goal is not blocked by current changes. Todd, do you think one could enable persistent, guaranteed streams, without shared storage assumption with this patch ?
          Hide
          Milind Bhandarkar added a comment -

          Suresh, based on your comments, you should vote -1 on this patch, because this patch calls persistBlocks only when supportAppends or hasHA. So, why not enable it everytime, unless a benchmark shows serious regression ?

          Show
          Milind Bhandarkar added a comment - Suresh, based on your comments, you should vote -1 on this patch, because this patch calls persistBlocks only when supportAppends or hasHA. So, why not enable it everytime, unless a benchmark shows serious regression ?
          Hide
          Suresh Srinivas added a comment -

          Data loss on failover is not acceptable. I thought that all transactions will be passed to the standby node and processed in RAM there, but not synced to disk. On failover you will have completely up to date namespace in SBN. No need to sync.

          If BackupNode is the only node that has the more current editlog, that means any restarts without involving the BackupNode could cause consistency issue. Isn't BackupNode stream treated as editlog. How do you differentiate between logSync() to that stream Vs stream to the disk?

          My vote is to call logSync() after block allocation. While this might hold a thread a little longer, I do not see an issue, unless a benchmark shows serious regression. Lets not optimize prematurely if this is not an issue.

          As I understood Suresh is trying to build a universal HA framework applicable to both.

          Again I have explained this in several jiras and in our conversations. It is not "Universal HA framework". The approach 2 has different requirements that many are interested in. I have repeated this many times.

          Show
          Suresh Srinivas added a comment - Data loss on failover is not acceptable. I thought that all transactions will be passed to the standby node and processed in RAM there, but not synced to disk. On failover you will have completely up to date namespace in SBN. No need to sync. If BackupNode is the only node that has the more current editlog, that means any restarts without involving the BackupNode could cause consistency issue. Isn't BackupNode stream treated as editlog. How do you differentiate between logSync() to that stream Vs stream to the disk? My vote is to call logSync() after block allocation. While this might hold a thread a little longer, I do not see an issue, unless a benchmark shows serious regression. Lets not optimize prematurely if this is not an issue. As I understood Suresh is trying to build a universal HA framework applicable to both. Again I have explained this in several jiras and in our conversations. It is not "Universal HA framework". The approach 2 has different requirements that many are interested in. I have repeated this many times.
          Hide
          Todd Lipcon added a comment -

          Hi Konstantin. I and several others am working on approach #2, initially – one in which the edit logs are accessible by both the active and standby NameNodes. Initially based on a NAS device, but potentially also supporting other shared storage such as BookKeeper.

          My understanding of the HA framework is that we are aiming to build it in such a way that neither approach is prevented - but obviously not all parts of one approach are necessary for the other approach. In this case, this patch is useful for proposal #2 but does not prevent approach #1. Given the APIs available in today's BackupNode, this patch is also necessary for approach #1.

          What you are describing is a potential optimization available in approach #1 that is not available in approach #2 - and like you said, it's a nice advantage. But without this patch, neither approach can be correct.

          To put it another way, there are three options:
          1) leave the code as it is now. That causes a potential dataloss in both approaches.
          2) commit this patch. This fixes a potential dataloss in both approaches but has a small performance impact.
          3) commit a better version of this patch which also fixes dataloss but also provides a more optimized code path for the BackupNode.

          Given that the code does not exist for option 3 above, and option 1 causes dataloss regardless of your opinions on HA, we should go with option 2 above.

          Keep in mind that, even outside the scope of HA, this fixes a potential bug with NameNode restart. If you have a small cluster that can restart quickly, this patch might allow the NN to restart within the retry window of a client, allowing clients to more easily ride over such restarts. Without the patch, any files in progress will be corrupted.

          Show
          Todd Lipcon added a comment - Hi Konstantin. I and several others am working on approach #2, initially – one in which the edit logs are accessible by both the active and standby NameNodes. Initially based on a NAS device, but potentially also supporting other shared storage such as BookKeeper. My understanding of the HA framework is that we are aiming to build it in such a way that neither approach is prevented - but obviously not all parts of one approach are necessary for the other approach. In this case, this patch is useful for proposal #2 but does not prevent approach #1. Given the APIs available in today's BackupNode, this patch is also necessary for approach #1. What you are describing is a potential optimization available in approach #1 that is not available in approach #2 - and like you said, it's a nice advantage. But without this patch, neither approach can be correct. To put it another way, there are three options: 1) leave the code as it is now. That causes a potential dataloss in both approaches. 2) commit this patch. This fixes a potential dataloss in both approaches but has a small performance impact. 3) commit a better version of this patch which also fixes dataloss but also provides a more optimized code path for the BackupNode. Given that the code does not exist for option 3 above, and option 1 causes dataloss regardless of your opinions on HA, we should go with option 2 above. Keep in mind that, even outside the scope of HA, this fixes a potential bug with NameNode restart. If you have a small cluster that can restart quickly, this patch might allow the NN to restart within the retry window of a client, allowing clients to more easily ride over such restarts. Without the patch, any files in progress will be corrupted.
          Hide
          Konstantin Shvachko added a comment -

          Yes, two proposals. So let me ask you the same question I've been asking Suresh. Which proposal are you implementing?
          1) RPC based (BackupNode)
          2) shared storage based (AvatarNode-like design)

          As I understood Suresh is trying to build a universal HA framework applicable to both. Your patch is intended for proposal 2 only and therefore contradicts his effort, if I am not missing anything.
          Is there a community decision on where we are going? Do we need to have a vote on this?

          I consider the performance overhead for syncing on each addBlock() as a disadvantage of shared storage based approach.

          I think this issue should be blocked until the HA approach dilemma is resolved, as it attempts to implicitly move development in one particular direction.

          Show
          Konstantin Shvachko added a comment - Yes, two proposals. So let me ask you the same question I've been asking Suresh. Which proposal are you implementing? 1) RPC based (BackupNode) 2) shared storage based (AvatarNode-like design) As I understood Suresh is trying to build a universal HA framework applicable to both. Your patch is intended for proposal 2 only and therefore contradicts his effort, if I am not missing anything. Is there a community decision on where we are going? Do we need to have a vote on this? I consider the performance overhead for syncing on each addBlock() as a disadvantage of shared storage based approach. I think this issue should be blocked until the HA approach dilemma is resolved, as it attempts to implicitly move development in one particular direction.
          Hide
          Todd Lipcon added a comment -

          Currently there are two proposals for the implementation of passing edits to the standby:
          1) RPC based (BackupNode)
          2) shared storage based (AvatarNode-like design)

          In the RPC-based mechanism, it makes sense to support a "semi-durable flush" - meaning that it has been ACKed to RAM on the standby. In the shared-storage model, there is no way to ensure that data has been made durable aside from issuing a full fsync.

          In essence what you're proposing is very much like the difference between "hflush" and "hsync" in the DFS write pipeline – ie a "flush to another node's memory" API distinct from "flush to disk".

          Given that this API is not currently implemented by the BackupNode, and the shared-storage approach requires an fsync, I don't think we should block this fix. Do you plan to implement this new BackupNode feature for trunk in the near future so that we can unblock this work?

          Show
          Todd Lipcon added a comment - Currently there are two proposals for the implementation of passing edits to the standby: 1) RPC based (BackupNode) 2) shared storage based (AvatarNode-like design) In the RPC-based mechanism, it makes sense to support a "semi-durable flush" - meaning that it has been ACKed to RAM on the standby. In the shared-storage model, there is no way to ensure that data has been made durable aside from issuing a full fsync. In essence what you're proposing is very much like the difference between "hflush" and "hsync" in the DFS write pipeline – ie a "flush to another node's memory" API distinct from "flush to disk". Given that this API is not currently implemented by the BackupNode, and the shared-storage approach requires an fsync, I don't think we should block this fix. Do you plan to implement this new BackupNode feature for trunk in the near future so that we can unblock this work?
          Hide
          Konstantin Shvachko added a comment -

          Data loss on failover is not acceptable. I thought that all transactions will be passed to the standby node and processed in RAM there, but not synced to disk. On failover you will have completely up to date namespace in SBN. No need to sync.

          Show
          Konstantin Shvachko added a comment - Data loss on failover is not acceptable. I thought that all transactions will be passed to the standby node and processed in RAM there, but not synced to disk. On failover you will have completely up to date namespace in SBN. No need to sync.
          Hide
          Todd Lipcon added a comment -

          This should be similar to how we treat setTime(). Please don't add logSync on every addBlock().

          The difference is that, if we miss a setTime() call, the worst thing that happens is that on failover, one of the files has an out-of-date access time. This is not a dataloss.

          If we miss a persistBlocks(), on failover we will lose data. This is dataloss, and unacceptable.

          How do you propose to avoid this dataloss? Or are you suggesting that dataloss on failover is acceptable?

          Show
          Todd Lipcon added a comment - This should be similar to how we treat setTime(). Please don't add logSync on every addBlock(). The difference is that, if we miss a setTime() call, the worst thing that happens is that on failover, one of the files has an out-of-date access time. This is not a dataloss. If we miss a persistBlocks(), on failover we will lose data. This is dataloss, and unacceptable. How do you propose to avoid this dataloss? Or are you suggesting that dataloss on failover is acceptable?
          Hide
          Konstantin Shvachko added a comment -

          We currently log addBlocks but don't call logSync, assumedly to save on performance.

          I think this is the right implementation for this functionality. We will always have consistent state in memory and rely on batching to do the actual sync, which is common.

          I'll add logSync for now, and if people find a regression we can add a config to disable it.

          This should be similar to how we treat setTime(). Please don't add logSync on every addBlock().
          Just to make sure its not misread, I am -1 on this.
          Regression of this particular change can be noticed only right now. I am also skeptical about introducing more config variables.

          Show
          Konstantin Shvachko added a comment - We currently log addBlocks but don't call logSync, assumedly to save on performance. I think this is the right implementation for this functionality. We will always have consistent state in memory and rely on batching to do the actual sync, which is common. I'll add logSync for now, and if people find a regression we can add a config to disable it. This should be similar to how we treat setTime(). Please don't add logSync on every addBlock(). Just to make sure its not misread, I am -1 on this. Regression of this particular change can be noticed only right now. I am also skeptical about introducing more config variables.
          Hide
          Eli Collins added a comment -

          +1 looks good

          Show
          Eli Collins added a comment - +1 looks good
          Hide
          Todd Lipcon added a comment -

          Above QA results are wrong since it ran against trunk. Here is test-patch against the QA branch:

          [exec] -1 overall.
          [exec]
          [exec] +1 @author. The patch does not contain any @author tags.
          [exec]
          [exec] +1 tests included. The patch appears to include 2 new or modified tests.
          [exec]
          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
          [exec]
          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
          [exec]
          [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.8) warnings.
          [exec]
          [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
          [exec]
          [exec] -1 system test framework. The patch failed system test framework compile.
          [exec]

          note that system tests are failing to build even in trunk right now, tracked by another JIRA

          Show
          Todd Lipcon added a comment - Above QA results are wrong since it ran against trunk. Here is test-patch against the QA branch: [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 2 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.8) warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec] [exec] -1 system test framework. The patch failed system test framework compile. [exec] note that system tests are failing to build even in trunk right now, tracked by another JIRA
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12490750/hdfs-1108.txt
          against trunk revision 1159004.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          -1 javac. The patch appears to cause tar ant target to fail.

          -1 findbugs. The patch appears to cause Findbugs (version 1.3.9) to fail.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:

          -1 contrib tests. The patch failed contrib unit tests.

          -1 system test framework. The patch failed system test framework compile.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/1120//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/1120//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12490750/hdfs-1108.txt against trunk revision 1159004. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The patch appears to cause tar ant target to fail. -1 findbugs. The patch appears to cause Findbugs (version 1.3.9) to fail. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: -1 contrib tests. The patch failed contrib unit tests. -1 system test framework. The patch failed system test framework compile. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/1120//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/1120//console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -
          • renamed config key
          • added persistBlocks to abandonBlock
          • added logSync where necessary
          Show
          Todd Lipcon added a comment - renamed config key added persistBlocks to abandonBlock added logSync where necessary
          Hide
          Todd Lipcon added a comment -

          We have to log sync after every allocation right, or do you mean only when HA is enabled? We could consider batching syncs at the cost of response latency

          Syncs are already somewhat batched inside of FSEditLog, but it does make the RPC take longer, holds up the thread, etc. We currently log addBlocks but don't call logSync, assumedly to save on performance. But, it's at the expense of correctness, which I think is a mistake. I'll add logSync for now, and if people find a regression we can add a config to disable it.

          Piggy backing the initial allocation on the create makes a lot of sense

          agree, let's do that separately

          to why DNs are deleting blocks that it currently thinks are part of the file

          abandonBlock happens if the pipeline fails to be established at all, so the DNs never even get the blocks. So, it's only to prevent a followup addBlock from adding on top. But I think you're right that it has to log it as well... at the worst, there's no harm in it.

          Show
          Todd Lipcon added a comment - We have to log sync after every allocation right, or do you mean only when HA is enabled? We could consider batching syncs at the cost of response latency Syncs are already somewhat batched inside of FSEditLog, but it does make the RPC take longer, holds up the thread, etc. We currently log addBlocks but don't call logSync, assumedly to save on performance. But, it's at the expense of correctness, which I think is a mistake. I'll add logSync for now, and if people find a regression we can add a config to disable it. Piggy backing the initial allocation on the create makes a lot of sense agree, let's do that separately to why DNs are deleting blocks that it currently thinks are part of the file abandonBlock happens if the pipeline fails to be established at all, so the DNs never even get the blocks. So, it's only to prevent a followup addBlock from adding on top. But I think you're right that it has to log it as well... at the worst, there's no harm in it.
          Hide
          Eli Collins added a comment -

          We have to log sync after every allocation right, or do you mean only when HA is enabled? We could consider batching syncs at the cost of response latency.

          Piggy backing the initial allocation on the create makes a lot of sense (also saves a RT). This might actually simplify DataStreamer as it would always start with a block (either the last block for append or the initial block for create). I think we should do the piggy back in a separate change and use the approach in the current patch (intern config option). Seems like we should rename it something more intuitive like dfs.log.block.allocs.

          I think we need to log block abandonement as well, otherwise if the standby becomes active and it doesn't see the block as abandoned then it will confused as to why DNs are deleting blocks that it currently thinks are part of the file.

          Show
          Eli Collins added a comment - We have to log sync after every allocation right, or do you mean only when HA is enabled? We could consider batching syncs at the cost of response latency. Piggy backing the initial allocation on the create makes a lot of sense (also saves a RT). This might actually simplify DataStreamer as it would always start with a block (either the last block for append or the initial block for create). I think we should do the piggy back in a separate change and use the approach in the current patch (intern config option). Seems like we should rename it something more intuitive like dfs.log.block.allocs. I think we need to log block abandonement as well, otherwise if the standby becomes active and it doesn't see the block as abandoned then it will confused as to why DNs are deleting blocks that it currently thinks are part of the file.
          Hide
          Todd Lipcon added a comment -

          A few questions:

          • should we syncLog() after every block? I took a look at the rpc metrics of a ~150 node cluster running HBase, and found that addBlock made up 3% of the operations, and 20% of the write operations. The number of create() operations and the number of addBlock() operations are very close to each other, indicating that at least on this cluster, most files consist of only one block. So, we could consider piggybacking the creation of the first block with the create() call, and then this wouldn't be an additional fsync to the logs (and would improve performance too)
          • abandonBlock() should maybe call persistBlocks() too?
          • should we document this new flag, or consider it an "internal" flag only used to override for testing? If we determine that the overhead is small, maybe we should just always have this behavior?
          Show
          Todd Lipcon added a comment - A few questions: should we syncLog() after every block? I took a look at the rpc metrics of a ~150 node cluster running HBase, and found that addBlock made up 3% of the operations, and 20% of the write operations. The number of create() operations and the number of addBlock() operations are very close to each other, indicating that at least on this cluster, most files consist of only one block. So, we could consider piggybacking the creation of the first block with the create() call, and then this wouldn't be an additional fsync to the logs (and would improve performance too) abandonBlock() should maybe call persistBlocks() too? should we document this new flag, or consider it an "internal" flag only used to override for testing? If we determine that the overhead is small, maybe we should just always have this behavior?
          Hide
          Todd Lipcon added a comment -

          Rebased on HA branch with following changes:

          • some places used to persist blocks only if append was enabled - now it also persists them in those cases if the configuration is set
          • the config is forced to true if HA is enabled
          • improved the unit test so the writer keeps writing after the NN restarts, and verifies that both the data written before and after the restart can be read back
          Show
          Todd Lipcon added a comment - Rebased on HA branch with following changes: some places used to persist blocks only if append was enabled - now it also persists them in those cases if the configuration is set the config is forced to true if HA is enabled improved the unit test so the writer keeps writing after the NN restarts, and verifies that both the data written before and after the restart can be read back
          Hide
          dhruba borthakur added a comment -

          Can you pl merge patch with latest trunk and post it on review board, thanks.

          Show
          dhruba borthakur added a comment - Can you pl merge patch with latest trunk and post it on review board, thanks.
          Hide
          dhruba borthakur added a comment -

          > how does namenode handle these blocks during restart, if the block was never written to the datanode?

          From my understanding, this problem exists in the current trunk too (and is not introduced by this patch) when an application uses hflush.

          Show
          dhruba borthakur added a comment - > how does namenode handle these blocks during restart, if the block was never written to the datanode? From my understanding, this problem exists in the current trunk too (and is not introduced by this patch) when an application uses hflush.
          Hide
          Dmytro Molkov added a comment -

          Suresh:
          1. Yes, the block information will essentially be persisted twice, on each block allocation and on file close. Do you think that can be a problem for us? Since it is a configurable change and this will only happen for specifically configured clusters I do not feel like this is bad.
          2. This part is tricky. I guess what can happen is: New block is allocated and then the client immediately dies without writing data + The namenode crashes and needs a restart. When the namenode is restarted it will have this last block as UnderConstruction and when NN tries to release the lease on this file it will try to recover the block and will never succeed because the block is not present on the datanodes.
          However it seems that it is the same case now, when namenode is not restarted the existence of this block in memory and absence of it on the datanodes will lead to the same problem.
          Or another case that is similar is when client calls hflush and then Namenode + client + all datanodes that are receiving the new block crash.

          Please correct me if I am wrong on this one.
          All in all it seems that if the namenode crashes it may lead to the client dying and so the probability of this happening might be higher than my first example of what might happen today?

          3. I am not really sure what you mean by primary flagging this to standby, but in our case the only channel of communication between primary and standby is in fact the edits log, so this seemed like a reasonable way to go.

          Show
          Dmytro Molkov added a comment - Suresh: 1. Yes, the block information will essentially be persisted twice, on each block allocation and on file close. Do you think that can be a problem for us? Since it is a configurable change and this will only happen for specifically configured clusters I do not feel like this is bad. 2. This part is tricky. I guess what can happen is: New block is allocated and then the client immediately dies without writing data + The namenode crashes and needs a restart. When the namenode is restarted it will have this last block as UnderConstruction and when NN tries to release the lease on this file it will try to recover the block and will never succeed because the block is not present on the datanodes. However it seems that it is the same case now, when namenode is not restarted the existence of this block in memory and absence of it on the datanodes will lead to the same problem. Or another case that is similar is when client calls hflush and then Namenode + client + all datanodes that are receiving the new block crash. Please correct me if I am wrong on this one. All in all it seems that if the namenode crashes it may lead to the client dying and so the probability of this happening might be higher than my first example of what might happen today? 3. I am not really sure what you mean by primary flagging this to standby, but in our case the only channel of communication between primary and standby is in fact the edits log, so this seemed like a reasonable way to go.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12447758/HDFS-1108.patch
          against trunk revision 957669.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 1 new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/203/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/203/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/203/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/203/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12447758/HDFS-1108.patch against trunk revision 957669. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/203/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/203/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/203/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/203/console This message is automatically generated.
          Hide
          Suresh Srinivas added a comment -

          I have few questions about this change:

          1. Which persistBlocks() are referring to? Does this mean block information is persisted twice - first time when it was created and second time when the file is closed to record the real block size?
          2. Given that a block is recorded during creation, how does namenode handle these blocks during restart, if the block was never written to the datanode?
          3. Should this be a HA related change where primary namenode flags this to standby instead of using editlogs written to the disk?
          Show
          Suresh Srinivas added a comment - I have few questions about this change: Which persistBlocks() are referring to? Does this mean block information is persisted twice - first time when it was created and second time when the file is closed to record the real block size? Given that a block is recorded during creation, how does namenode handle these blocks during restart, if the block was never written to the datanode? Should this be a HA related change where primary namenode flags this to standby instead of using editlogs written to the disk?
          Hide
          Dmytro Molkov added a comment -

          This is a rather simple patch.
          When we allocate new block - call persistBlocks on the file. While we do not sync the transaction log to disk this seems to be reasonable enough, since the only problematic case is with the machine crashing.

          The test creates a file, waits for a few blocks to become available, then stops the mini cluster without closing the file and starts it again.
          The length of the file cannot be smaller than before the restart if the blocks were persisted into log.

          Show
          Dmytro Molkov added a comment - This is a rather simple patch. When we allocate new block - call persistBlocks on the file. While we do not sync the transaction log to disk this seems to be reasonable enough, since the only problematic case is with the machine crashing. The test creates a file, waits for a few blocks to become available, then stops the mini cluster without closing the file and starts it again. The length of the file cannot be smaller than before the restart if the blocks were persisted into log.
          Hide
          dhruba borthakur added a comment -

          Todd: I agree that it is most likely that HBase will need this feature for all files. In that case, a cluster-wide-server side option to persist all new block allocations immediately makes sense.

          Show
          dhruba borthakur added a comment - Todd: I agree that it is most likely that HBase will need this feature for all files. In that case, a cluster-wide-server side option to persist all new block allocations immediately makes sense.
          Hide
          Todd Lipcon added a comment -

          OK, I see your point. The remaining question, then, would be why only do this for specific files? Do you forsee a significant speed hit to just logging all allocateBlocks calls to the edit log? Is the throughput of such calls significant enough that we need to bother treating "important" files differently?

          Show
          Todd Lipcon added a comment - OK, I see your point. The remaining question, then, would be why only do this for specific files? Do you forsee a significant speed hit to just logging all allocateBlocks calls to the edit log? Is the throughput of such calls significant enough that we need to bother treating "important" files differently?
          Hide
          dhruba borthakur added a comment -

          This is related to namenode HA.

          suppose an application has created a file with one block and started writing data to that block. The writer has not yet written a full block worth of data to the file. Now, the NN fails over to the hot standby. The writer who was writing data should continue to write data to the file and should not see any interruption at all (assuming that the failover was done in a few seconds). For this use case, we need the ability to persist block allocations as soon as the block is allocated to a file.

          The above could be alternatively achieved by making the DFSClient always issue a fsync for every new block allocation. This is not efficient because this translates to two RPCs for every new block allocation. Does this make sense?

          Show
          dhruba borthakur added a comment - This is related to namenode HA. suppose an application has created a file with one block and started writing data to that block. The writer has not yet written a full block worth of data to the file. Now, the NN fails over to the hot standby. The writer who was writing data should continue to write data to the file and should not see any interruption at all (assuming that the failover was done in a few seconds). For this use case, we need the ability to persist block allocations as soon as the block is allocated to a file. The above could be alternatively achieved by making the DFSClient always issue a fsync for every new block allocation. This is not efficient because this translates to two RPCs for every new block allocation. Does this make sense?
          Hide
          Todd Lipcon added a comment -

          Why do you say the hflush doesn't take care of this?

          Show
          Todd Lipcon added a comment - Why do you say the hflush doesn't take care of this?
          Hide
          dhruba borthakur added a comment -

          This feature is needed to make HBase work seamlessly with NameNode HA (HDFS-976). Suppose a region server created a new file and has written some data to it and then the NN failed-over to the hot-standby. The client should be able to continue writing data to the same file that is now being served by the new NameNode.

          Show
          dhruba borthakur added a comment - This feature is needed to make HBase work seamlessly with NameNode HA ( HDFS-976 ). Suppose a region server created a new file and has written some data to it and then the NN failed-over to the hot-standby. The client should be able to continue writing data to the same file that is now being served by the new NameNode.

            People

            • Assignee:
              Todd Lipcon
              Reporter:
              dhruba borthakur
            • Votes:
              0 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development