Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-12943 Consistent Reads from Standby Node
  3. HDFS-13150

[Edit Tail Fast Path] Allow SbNN to tail in-progress edits from JN via RPC

Details

    Description

      In the interest of making coordinated/consistent reads easier to complete with low latency, it is advantageous to reduce the time between when a transaction is applied on the ANN and when it is applied on the SbNN. We propose adding a new "fast path" which can be used to tail edits when low latency is desired. We leave the existing tailing logic in place, and fall back to this path on startup, recovery, and when the fast path encounters unrecoverable errors.

      Attachments

        1. edit-tailing-fast-path-design-v2.pdf
          172 kB
          Erik Krogen
        2. edit-tailing-fast-path-design-v1.pdf
          171 kB
          Erik Krogen
        3. edit-tailing-fast-path-design-v0.pdf
          165 kB
          Erik Krogen

        Issue Links

          Activity

            xkrogen Erik Krogen added a comment -

            Design document with more details coming soon...

            xkrogen Erik Krogen added a comment - Design document with more details coming soon...
            xkrogen Erik Krogen added a comment -

            Attached a design document detailing the proposal, comments welcomed!

            cc csun

            xkrogen Erik Krogen added a comment - Attached a design document detailing the proposal, comments welcomed! cc csun
            csun Chao Sun added a comment -

            Thanks for the design doc xkrogen! The doc looks great overall. I have a few comments after reading it:

            • what is the relation of this with the in-progress edit log tailing? also, is this a separate feature that can be turned on/off? or we target this as a replacement for the current approach?
            • how much savings did you see by performing preloading edits outside the lock? I did something similar in my experiment but it didn't show obvious benefit.
            • with the cache, does it mean we need to extra decode + encode on the journal side? is there any perf impact on the journal?
            • in the performance evaluation section, the max(ms) in scenario 5 is extremely high comparing to others, is that a mistake?
            • To achieve low latency the Observer NameNode also needs to pull from the journal nodes in a very high frequency right? did you cover that in the benchmark?
            • might be better to explain a little that on the high-level how this can be used for Observer NameNode.
            csun Chao Sun added a comment - Thanks for the design doc xkrogen ! The doc looks great overall. I have a few comments after reading it: what is the relation of this with the in-progress edit log tailing? also, is this a separate feature that can be turned on/off? or we target this as a replacement for the current approach? how much savings did you see by performing preloading edits outside the lock? I did something similar in my experiment but it didn't show obvious benefit. with the cache, does it mean we need to extra decode + encode on the journal side? is there any perf impact on the journal? in the performance evaluation section, the max(ms) in scenario 5 is extremely high comparing to others, is that a mistake? To achieve low latency the Observer NameNode also needs to pull from the journal nodes in a very high frequency right? did you cover that in the benchmark? might be better to explain a little that on the high-level how this can be used for Observer NameNode.
            xkrogen Erik Krogen added a comment -

            Thanks for the read, Chao! Responses inline.

            what is the relation of this with the in-progress edit log tailing? also, is this a separate feature that can be turned on/off? or we target this as a replacement for the current approach?

            The addition of this feature does not necessarily remove the possibility of using the traditional edit tail path to read in-progress edits, so we could have it enabled via a separate flag. This would create a situation where, for the feature to work as intended, you have to change 3 configurations: enable in-progress edit log tailing, enable the fast-path, and turn the edit tail period (dfs.ha.tail-edits.period) down to 0. We should probably leave the edit tail period config as-is, but given that in-progress edit log tailing is a new/experimental feature anyhow, perhaps we can just enable the fast path as part of enabling in-progress edit log tailing, reducing it to two configurations.

            how much savings did you see by performing preloading edits outside the lock? I did something similar in my experiment but it didn't show obvious benefit.

            I also didn't see a significant performance improvement, but I think in general it is a good idea to avoid I/O inside of a lock if possible to protect against temporary hiccups.

            with the cache, does it mean we need to extra decode + encode on the journal side? is there any perf impact on the journal?

            Yes, unfortunately this adds an extra decode/encode cycle on the JN side. We did not see any significant performance impact on the JNs in our experiments. If this ends up being an issue, we can reduce this to a single decode (no re-encode) by storing the cache as the original serialized form rather than the deserialized form. This is a bit more complex so I would prefer not to unless it turns out to be an issue in practice.

            in the performance evaluation section, the max(ms) in scenario 5 is extremely high comparing to others, is that a mistake?

            No, this is what was measured. I haven't seen the same other times I've run the experiment but didn't want to feel I was faking results by providing a cleaner test run Given the very PoC nature of the code that this benchmark was on, I hope to eliminate any such issues when we create a production-ready version. In particular it may be helpful to log when latencies are higher than expected, similar to the idea of HDFS-9145, so that we can track down any issues.

            To achieve low latency the Observer NameNode also needs to pull from the journal nodes in a very high frequency right? did you cover that in the benchmark?

            This is correct; in the benchmark the sleep period for edit tailing was turned down to 0.

            might be better to explain a little that on the high-level how this can be used for Observer NameNode.

            Sure, added a little bit of discussion in Applicability to ObserverNode in v1.

            xkrogen Erik Krogen added a comment - Thanks for the read, Chao! Responses inline. what is the relation of this with the in-progress edit log tailing? also, is this a separate feature that can be turned on/off? or we target this as a replacement for the current approach? The addition of this feature does not necessarily remove the possibility of using the traditional edit tail path to read in-progress edits, so we could have it enabled via a separate flag. This would create a situation where, for the feature to work as intended, you have to change 3 configurations: enable in-progress edit log tailing, enable the fast-path, and turn the edit tail period ( dfs.ha.tail-edits.period ) down to 0. We should probably leave the edit tail period config as-is, but given that in-progress edit log tailing is a new/experimental feature anyhow, perhaps we can just enable the fast path as part of enabling in-progress edit log tailing, reducing it to two configurations. how much savings did you see by performing preloading edits outside the lock? I did something similar in my experiment but it didn't show obvious benefit. I also didn't see a significant performance improvement, but I think in general it is a good idea to avoid I/O inside of a lock if possible to protect against temporary hiccups. with the cache, does it mean we need to extra decode + encode on the journal side? is there any perf impact on the journal? Yes, unfortunately this adds an extra decode/encode cycle on the JN side. We did not see any significant performance impact on the JNs in our experiments. If this ends up being an issue, we can reduce this to a single decode (no re-encode) by storing the cache as the original serialized form rather than the deserialized form. This is a bit more complex so I would prefer not to unless it turns out to be an issue in practice. in the performance evaluation section, the max(ms) in scenario 5 is extremely high comparing to others, is that a mistake? No, this is what was measured. I haven't seen the same other times I've run the experiment but didn't want to feel I was faking results by providing a cleaner test run Given the very PoC nature of the code that this benchmark was on, I hope to eliminate any such issues when we create a production-ready version. In particular it may be helpful to log when latencies are higher than expected, similar to the idea of HDFS-9145 , so that we can track down any issues. To achieve low latency the Observer NameNode also needs to pull from the journal nodes in a very high frequency right? did you cover that in the benchmark? This is correct; in the benchmark the sleep period for edit tailing was turned down to 0. might be better to explain a little that on the high-level how this can be used for Observer NameNode. Sure, added a little bit of discussion in Applicability to ObserverNode in v1.
            csun Chao Sun added a comment -

            Thanks Erik. Overall I'm good with the design . I also like the approach 1): SbNN perform quorum reads better and think overall it should be correct. Looking forward to this feature!

            csun Chao Sun added a comment - Thanks Erik. Overall I'm good with the design . I also like the approach 1): SbNN perform quorum reads better and think overall it should be correct. Looking forward to this feature!

            As a reminder, the two approaches for SBN / ObserverNode reading from Journal nodes are:

            1. SBN reads from a quorum of JNs
            2. SBN reads from single JN, while JNs guarantee serving only committed transactions

            I am advocating that approach 2 is faster. Suppose we have 3 journal nodes.

            • When SBN reads from the quorum (approach 1) it updates its state as fast as the second slowest JN.
            • With approach 2 we can choose the fastest JN most of the time. By periodically polling JNs and switching to the one that has higher txId.

            There is an issue of confirming committed transactions from ANN to JNs. But every next batch of edits sent by ANN to a JN essentially confirms that the previous batch is committed. This does not require extra dummy syncs. Under regular load ANN will be sending batches of edits continuously, so JNs will be up-to-date up to the last processed batch. ANN will occasionally need to send the extra “dummy” sync, but it is required only if ANN doesn't have load at all or no writes.
            Having said that, I am fine with quorum reads as the initial implementation if it is simpler, as I was told.

            shv Konstantin Shvachko added a comment - As a reminder, the two approaches for SBN / ObserverNode reading from Journal nodes are: SBN reads from a quorum of JNs SBN reads from single JN, while JNs guarantee serving only committed transactions I am advocating that approach 2 is faster. Suppose we have 3 journal nodes. When SBN reads from the quorum (approach 1) it updates its state as fast as the second slowest JN. With approach 2 we can choose the fastest JN most of the time. By periodically polling JNs and switching to the one that has higher txId. There is an issue of confirming committed transactions from ANN to JNs. But every next batch of edits sent by ANN to a JN essentially confirms that the previous batch is committed. This does not require extra dummy syncs. Under regular load ANN will be sending batches of edits continuously, so JNs will be up-to-date up to the last processed batch. ANN will occasionally need to send the extra “dummy” sync, but it is required only if ANN doesn't have load at all or no writes. Having said that, I am fine with quorum reads as the initial implementation if it is simpler, as I was told.
            xkrogen Erik Krogen added a comment - - edited

            Attaching v2 document with some changes to reflect decisions made during implementation. The only larger change is to the way the in-memory cache of edits is set up (see section JournalNode Edit Cache).

            I have created four parts to the JIRA to split it into smaller, easier to review sections: HDFS-13607, HDFS-13608, HDFS-13609, HDFS-13610. Uploading patches there now.

            xkrogen Erik Krogen added a comment - - edited Attaching v2 document with some changes to reflect decisions made during implementation. The only larger change is to the way the in-memory cache of edits is set up (see section JournalNode Edit Cache ). I have created four parts to the JIRA to split it into smaller, easier to review sections: HDFS-13607 , HDFS-13608 , HDFS-13609 , HDFS-13610 . Uploading patches there now.
            xkrogen Erik Krogen added a comment -

            Closing this as all sub-issues (HDFS-13607, HDFS-13608, HDFS-13609, HDFS-13610) have been completed. Thanks to all who helped with this new feature!

            xkrogen Erik Krogen added a comment - Closing this as all sub-issues ( HDFS-13607 , HDFS-13608 , HDFS-13609 , HDFS-13610 ) have been completed. Thanks to all who helped with this new feature!
            liutongwei liutongwei added a comment -

            xkrogen As I'm learning the design doc of fast path tailing, I have a doubt about the correctness of the approach 1 to ensure only committed transactions are applied.

            In this design, the minimum ​lastWrittenTxId is used to get a safe point to applied log. But in some corner case, if there is a out-synced JN back online, it may indeed got the minimum ​lastWrittenTxId, but the ​lastWrittenTxId's data in this JN may differ from other JNs. This may because the lastWrittenTxId's in the out-synced JN is not committed by the prior writer, and was overwritten by new epoch writer.In this case, we got uncommitted data.

            How about use a quorum read combined  approach 2 to get max committed txid, it definitely correct because committed txid is updated by the writer, it was guaranteed by the writer even a recovery was occurred.

            Correct me if anything is wrong.

            Looking forward for you reply.

            liutongwei liutongwei added a comment - xkrogen  As I'm learning the design doc of fast path tailing, I have a doubt about the correctness of the approach 1 to ensure only committed transactions are applied. In this design, the minimum ​lastWrittenTxId is used to get a safe point to applied log. But in some corner case, if there is a out-synced JN back online, it may indeed got the minimum ​lastWrittenTxId, but the ​lastWrittenTxId's data in this JN may differ from other JNs. This may because the lastWrittenTxId's in the out-synced JN is not committed by the prior writer, and was overwritten by new epoch writer.In this case, we got uncommitted data. How about use a quorum read combined  approach 2 to get max committed txid, it definitely correct because committed txid is updated by the writer, it was guaranteed by the writer even a recovery was occurred. Correct me if anything is wrong. Looking forward for you reply.
            xkrogen Erik Krogen added a comment -

            liutongwei thanks for sharing your concern!

            I don't quite remember how epochs interplay with the durability or reuse of transaction IDs – it's been quite a while since I've looked at this area of the code. Unfortunately I'm also not actively working on HDFS currently. I took a brief look around the JN code in this area to refresh my memory, but I'm still missing some details and don't have the time to invest in properly understanding your concern.

            shv, do you have any insight on the concern above?

            xkrogen Erik Krogen added a comment - liutongwei thanks for sharing your concern! I don't quite remember how epochs interplay with the durability or reuse of transaction IDs – it's been quite a while since I've looked at this area of the code. Unfortunately I'm also not actively working on HDFS currently. I took a brief look around the JN code in this area to refresh my memory, but I'm still missing some details and don't have the time to invest in properly understanding your concern. shv , do you have any insight on the concern above?

            We end up implementing quorum read from JNs for Observer fast path.
            You should check the code liutongwei

            shv Konstantin Shvachko added a comment - We end up implementing quorum read from JNs for Observer fast path. You should check the code liutongwei

            People

              xkrogen Erik Krogen
              xkrogen Erik Krogen
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: