The current edit tailing pipeline is designed for
- High resiliency
- High throughput
and was not designed for low latency.
It was designed under the assumption that each edit log segment would typically be read all at once, e.g. on startup or the SbNN tailing the entire thing after it is finalized. The ObserverNode should be reading constantly from the JournalNodes' in-progress edit logs with low latency, to reduce the lag time from when a transaction is committed on the ANN and when it is visible on the ObserverNode.
Due to the critical nature of this pipeline to the health of HDFS, it would be better not to redesign it altogether. Based on some experiments it seems if we mitigate the following issues, lag times are reduced to low levels (low hundreds of milliseconds even under very high write load):
- The overhead of creating a new HTTP connection for each time new edits are fetched. This makes sense when you're expecting to tail an entire segment; it does not when you may only be fetching a small number of edits. We can mitigate this by allowing edits to be tailed via an RPC call, or by adding a connection pool for the existing connections to the journal.
- The overhead of transmitting a whole file at once. Right now when an edit segment is requested, the JN sends the entire segment, and on the SbNN it will ignore edits up to the ones it wants. How to solve this one may be more tricky, but one suggestion would be to keep recently logged edits in memory, avoiding the need to serve them from file at all, allowing the JN to quickly serve only the required edits.
We can implement these as optimizations on top of the existing logic, with fallbacks to the current slow-but-resilient pipeline.