[HDFS-13150] [Edit Tail Fast Path] Allow SbNN to tail in-progress edits from JN via RPC - ASF JIRA

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: HDFS-12943, 3.3.0
Component/s: ha, hdfs, journal-node, namenode
Labels:
None

Description

In the interest of making coordinated/consistent reads easier to complete with low latency, it is advantageous to reduce the time between when a transaction is applied on the ANN and when it is applied on the SbNN. We propose adding a new "fast path" which can be used to tail edits when low latency is desired. We leave the existing tailing logic in place, and fall back to this path on startup, recovery, and when the fast path encounters unrecoverable errors.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

edit-tailing-fast-path-design-v2.pdf
23/May/18 16:41
172 kB
Erik Krogen
edit-tailing-fast-path-design-v1.pdf
26/Feb/18 21:11
171 kB
Erik Krogen
edit-tailing-fast-path-design-v0.pdf
23/Feb/18 21:04
165 kB
Erik Krogen

Issue Links

depends upon

HDFS-13607 [Edit Tail Fast Path Pt 1] Enhance JournalNode with an in-memory cache of recent edit transactions

Resolved

HDFS-13608 [Edit Tail Fast Path Pt 2] Add ability for JournalNode to serve edits via RPC

Resolved

HDFS-13609 [Edit Tail Fast Path Pt 3] NameNode-side changes to support tailing edits via RPC

Resolved

HDFS-13610 [Edit Tail Fast Path Pt 4] Cleanup: integration test, documentation, remove unnecessary dummy sync

Resolved

is related to

HADOOP-16208 Do Not Log InterruptedException in Client

Resolved

relates to

HDFS-13789 Reduce logging frequency of QuorumJournalManager#selectInputStreams

Resolved

HDFS-16493 [SBN Read]When fast path tail enabled, standby or observer namenode may read uncommitted data

Open

HDFS-14500 NameNode StartupProgress continues to report edit log segments after the LOADING_EDITS phase is finished

Resolved

(3 relates to)

Activity

Ascending order - Click to sort in descending order

Erik Krogen added a comment - 15/Feb/18 17:32

Design document with more details coming soon...

Erik Krogen added a comment - 15/Feb/18 17:32 Design document with more details coming soon...

Erik Krogen added a comment - 23/Feb/18 21:04

Attached a design document detailing the proposal, comments welcomed!

cc csun

Erik Krogen added a comment - 23/Feb/18 21:04 Attached a design document detailing the proposal, comments welcomed! cc csun

Chao Sun added a comment - 26/Feb/18 19:33

Thanks for the design doc xkrogen! The doc looks great overall. I have a few comments after reading it:

what is the relation of this with the in-progress edit log tailing? also, is this a separate feature that can be turned on/off? or we target this as a replacement for the current approach?
how much savings did you see by performing preloading edits outside the lock? I did something similar in my experiment but it didn't show obvious benefit.
with the cache, does it mean we need to extra decode + encode on the journal side? is there any perf impact on the journal?
in the performance evaluation section, the max(ms) in scenario 5 is extremely high comparing to others, is that a mistake?
To achieve low latency the Observer NameNode also needs to pull from the journal nodes in a very high frequency right? did you cover that in the benchmark?
might be better to explain a little that on the high-level how this can be used for Observer NameNode.

Chao Sun added a comment - 26/Feb/18 19:33 Thanks for the design doc xkrogen ! The doc looks great overall. I have a few comments after reading it: what is the relation of this with the in-progress edit log tailing? also, is this a separate feature that can be turned on/off? or we target this as a replacement for the current approach? how much savings did you see by performing preloading edits outside the lock? I did something similar in my experiment but it didn't show obvious benefit. with the cache, does it mean we need to extra decode + encode on the journal side? is there any perf impact on the journal? in the performance evaluation section, the max(ms) in scenario 5 is extremely high comparing to others, is that a mistake? To achieve low latency the Observer NameNode also needs to pull from the journal nodes in a very high frequency right? did you cover that in the benchmark? might be better to explain a little that on the high-level how this can be used for Observer NameNode.

Erik Krogen added a comment - 26/Feb/18 21:09

Thanks for the read, Chao! Responses inline.

what is the relation of this with the in-progress edit log tailing? also, is this a separate feature that can be turned on/off? or we target this as a replacement for the current approach?

The addition of this feature does not necessarily remove the possibility of using the traditional edit tail path to read in-progress edits, so we could have it enabled via a separate flag. This would create a situation where, for the feature to work as intended, you have to change 3 configurations: enable in-progress edit log tailing, enable the fast-path, and turn the edit tail period (dfs.ha.tail-edits.period) down to 0. We should probably leave the edit tail period config as-is, but given that in-progress edit log tailing is a new/experimental feature anyhow, perhaps we can just enable the fast path as part of enabling in-progress edit log tailing, reducing it to two configurations.

how much savings did you see by performing preloading edits outside the lock? I did something similar in my experiment but it didn't show obvious benefit.

I also didn't see a significant performance improvement, but I think in general it is a good idea to avoid I/O inside of a lock if possible to protect against temporary hiccups.

with the cache, does it mean we need to extra decode + encode on the journal side? is there any perf impact on the journal?

Yes, unfortunately this adds an extra decode/encode cycle on the JN side. We did not see any significant performance impact on the JNs in our experiments. If this ends up being an issue, we can reduce this to a single decode (no re-encode) by storing the cache as the original serialized form rather than the deserialized form. This is a bit more complex so I would prefer not to unless it turns out to be an issue in practice.

in the performance evaluation section, the max(ms) in scenario 5 is extremely high comparing to others, is that a mistake?

No, this is what was measured. I haven't seen the same other times I've run the experiment but didn't want to feel I was faking results by providing a cleaner test run Given the very PoC nature of the code that this benchmark was on, I hope to eliminate any such issues when we create a production-ready version. In particular it may be helpful to log when latencies are higher than expected, similar to the idea of ~~HDFS-9145~~, so that we can track down any issues.

To achieve low latency the Observer NameNode also needs to pull from the journal nodes in a very high frequency right? did you cover that in the benchmark?

This is correct; in the benchmark the sleep period for edit tailing was turned down to 0.

might be better to explain a little that on the high-level how this can be used for Observer NameNode.

Sure, added a little bit of discussion in Applicability to ObserverNode in v1.

Erik Krogen added a comment - 26/Feb/18 21:09 Thanks for the read, Chao! Responses inline. what is the relation of this with the in-progress edit log tailing? also, is this a separate feature that can be turned on/off? or we target this as a replacement for the current approach? The addition of this feature does not necessarily remove the possibility of using the traditional edit tail path to read in-progress edits, so we could have it enabled via a separate flag. This would create a situation where, for the feature to work as intended, you have to change 3 configurations: enable in-progress edit log tailing, enable the fast-path, and turn the edit tail period ( dfs.ha.tail-edits.period ) down to 0. We should probably leave the edit tail period config as-is, but given that in-progress edit log tailing is a new/experimental feature anyhow, perhaps we can just enable the fast path as part of enabling in-progress edit log tailing, reducing it to two configurations. how much savings did you see by performing preloading edits outside the lock? I did something similar in my experiment but it didn't show obvious benefit. I also didn't see a significant performance improvement, but I think in general it is a good idea to avoid I/O inside of a lock if possible to protect against temporary hiccups. with the cache, does it mean we need to extra decode + encode on the journal side? is there any perf impact on the journal? Yes, unfortunately this adds an extra decode/encode cycle on the JN side. We did not see any significant performance impact on the JNs in our experiments. If this ends up being an issue, we can reduce this to a single decode (no re-encode) by storing the cache as the original serialized form rather than the deserialized form. This is a bit more complex so I would prefer not to unless it turns out to be an issue in practice. in the performance evaluation section, the max(ms) in scenario 5 is extremely high comparing to others, is that a mistake? No, this is what was measured. I haven't seen the same other times I've run the experiment but didn't want to feel I was faking results by providing a cleaner test run Given the very PoC nature of the code that this benchmark was on, I hope to eliminate any such issues when we create a production-ready version. In particular it may be helpful to log when latencies are higher than expected, similar to the idea of HDFS-9145 , so that we can track down any issues. To achieve low latency the Observer NameNode also needs to pull from the journal nodes in a very high frequency right? did you cover that in the benchmark? This is correct; in the benchmark the sleep period for edit tailing was turned down to 0. might be better to explain a little that on the high-level how this can be used for Observer NameNode. Sure, added a little bit of discussion in Applicability to ObserverNode in v1.

Chao Sun added a comment - 27/Feb/18 23:59

Thanks Erik. Overall I'm good with the design . I also like the approach 1): SbNN perform quorum reads better and think overall it should be correct. Looking forward to this feature!

Chao Sun added a comment - 27/Feb/18 23:59 Thanks Erik. Overall I'm good with the design . I also like the approach 1): SbNN perform quorum reads better and think overall it should be correct. Looking forward to this feature!

Konstantin Shvachko added a comment - 13/Mar/18 22:03

As a reminder, the two approaches for SBN / ObserverNode reading from Journal nodes are:

SBN reads from a quorum of JNs
SBN reads from single JN, while JNs guarantee serving only committed transactions

I am advocating that approach 2 is faster. Suppose we have 3 journal nodes.

When SBN reads from the quorum (approach 1) it updates its state as fast as the second slowest JN.
With approach 2 we can choose the fastest JN most of the time. By periodically polling JNs and switching to the one that has higher txId.

There is an issue of confirming committed transactions from ANN to JNs. But every next batch of edits sent by ANN to a JN essentially confirms that the previous batch is committed. This does not require extra dummy syncs. Under regular load ANN will be sending batches of edits continuously, so JNs will be up-to-date up to the last processed batch. ANN will occasionally need to send the extra “dummy” sync, but it is required only if ANN doesn't have load at all or no writes.
Having said that, I am fine with quorum reads as the initial implementation if it is simpler, as I was told.

Konstantin Shvachko added a comment - 13/Mar/18 22:03 As a reminder, the two approaches for SBN / ObserverNode reading from Journal nodes are: SBN reads from a quorum of JNs SBN reads from single JN, while JNs guarantee serving only committed transactions I am advocating that approach 2 is faster. Suppose we have 3 journal nodes. When SBN reads from the quorum (approach 1) it updates its state as fast as the second slowest JN. With approach 2 we can choose the fastest JN most of the time. By periodically polling JNs and switching to the one that has higher txId. There is an issue of confirming committed transactions from ANN to JNs. But every next batch of edits sent by ANN to a JN essentially confirms that the previous batch is committed. This does not require extra dummy syncs. Under regular load ANN will be sending batches of edits continuously, so JNs will be up-to-date up to the last processed batch. ANN will occasionally need to send the extra “dummy” sync, but it is required only if ANN doesn't have load at all or no writes. Having said that, I am fine with quorum reads as the initial implementation if it is simpler, as I was told.

Erik Krogen added a comment - 23/May/18 16:43 - edited

Attaching v2 document with some changes to reflect decisions made during implementation. The only larger change is to the way the in-memory cache of edits is set up (see section JournalNode Edit Cache).

I have created four parts to the JIRA to split it into smaller, easier to review sections: ~~HDFS-13607~~, ~~HDFS-13608~~, ~~HDFS-13609~~, ~~HDFS-13610~~. Uploading patches there now.

Erik Krogen added a comment - 23/May/18 16:43 - edited Attaching v2 document with some changes to reflect decisions made during implementation. The only larger change is to the way the in-memory cache of edits is set up (see section JournalNode Edit Cache ). I have created four parts to the JIRA to split it into smaller, easier to review sections: HDFS-13607 , HDFS-13608 , HDFS-13609 , HDFS-13610 . Uploading patches there now.

Erik Krogen added a comment - 17/Jul/18 18:37

Closing this as all sub-issues (~~HDFS-13607~~, ~~HDFS-13608~~, ~~HDFS-13609~~, ~~HDFS-13610~~) have been completed. Thanks to all who helped with this new feature!

Erik Krogen added a comment - 17/Jul/18 18:37 Closing this as all sub-issues ( HDFS-13607 , HDFS-13608 , HDFS-13609 , HDFS-13610 ) have been completed. Thanks to all who helped with this new feature!

liutongwei added a comment - 02/Sep/21 02:15

xkrogen As I'm learning the design doc of fast path tailing, I have a doubt about the correctness of the approach 1 to ensure only committed transactions are applied.

In this design, the minimum lastWrittenTxId is used to get a safe point to applied log. But in some corner case, if there is a out-synced JN back online, it may indeed got the minimum lastWrittenTxId, but the lastWrittenTxId's data in this JN may differ from other JNs. This may because the lastWrittenTxId's in the out-synced JN is not committed by the prior writer, and was overwritten by new epoch writer.In this case, we got uncommitted data.

How about use a quorum read combined approach 2 to get max committed txid, it definitely correct because committed txid is updated by the writer, it was guaranteed by the writer even a recovery was occurred.

Correct me if anything is wrong.

Looking forward for you reply.

liutongwei added a comment - 02/Sep/21 02:15 xkrogen As I'm learning the design doc of fast path tailing, I have a doubt about the correctness of the approach 1 to ensure only committed transactions are applied. In this design, the minimum lastWrittenTxId is used to get a safe point to applied log. But in some corner case, if there is a out-synced JN back online, it may indeed got the minimum lastWrittenTxId, but the lastWrittenTxId's data in this JN may differ from other JNs. This may because the lastWrittenTxId's in the out-synced JN is not committed by the prior writer, and was overwritten by new epoch writer.In this case, we got uncommitted data. How about use a quorum read combined approach 2 to get max committed txid, it definitely correct because committed txid is updated by the writer, it was guaranteed by the writer even a recovery was occurred. Correct me if anything is wrong. Looking forward for you reply.

Erik Krogen added a comment - 15/Sep/21 17:51

liutongwei thanks for sharing your concern!

I don't quite remember how epochs interplay with the durability or reuse of transaction IDs – it's been quite a while since I've looked at this area of the code. Unfortunately I'm also not actively working on HDFS currently. I took a brief look around the JN code in this area to refresh my memory, but I'm still missing some details and don't have the time to invest in properly understanding your concern.

shv, do you have any insight on the concern above?

Erik Krogen added a comment - 15/Sep/21 17:51 liutongwei thanks for sharing your concern! I don't quite remember how epochs interplay with the durability or reuse of transaction IDs – it's been quite a while since I've looked at this area of the code. Unfortunately I'm also not actively working on HDFS currently. I took a brief look around the JN code in this area to refresh my memory, but I'm still missing some details and don't have the time to invest in properly understanding your concern. shv , do you have any insight on the concern above?

Konstantin Shvachko added a comment - 30/Sep/21 23:34

We end up implementing quorum read from JNs for Observer fast path.
You should check the code liutongwei

Konstantin Shvachko added a comment - 30/Sep/21 23:34 We end up implementing quorum read from JNs for Observer fast path. You should check the code liutongwei

People

Assignee:: Erik Krogen

Reporter:: Erik Krogen

Votes:: 0 Vote for this issue

Watchers:: 15 Start watching this issue

Dates

Created:: 15/Feb/18 17:32

Updated:: 04/Mar/22 16:47

Resolved:: 17/Jul/18 18:37

Hadoop HDFS

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates