[HDFS-14211] [Consistent Observer Reads] Allow for configurable "always msync" mode - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 3.3.0
Component/s: hdfs-client
Labels:
None

Description

To allow for reads to be serviced from an ObserverNode (see ~~HDFS-12943~~) in a consistent way, an msync API was introduced (~~HDFS-13688~~) to allow for a client to fetch the latest transaction ID from the Active NN, thereby ensuring that subsequent reads from the ObserverNode will be up-to-date with the current state of the Active.

Using this properly, however, requires application-side changes: for examples, a NodeManager should call msync before localizing the resources for a client, since it received notification of the existence of those resources via communicate which is out-of-band to HDFS and thus could potentially attempt to localize them prior to the availability of those resources on the ObserverNode.

Until such application-side changes can be made, which will be a longer-term effort, we need to provide a mechanism for unchanged clients to utilize the ObserverNode without exposing such a client to inconsistencies. This is essentially phase 3 of the roadmap outlined in the design document for ~~HDFS-12943~~.

The design document proposes some heuristics based on understanding of how common applications (e.g. MR) use HDFS for resources. As an initial pass, we can simply have a flag which tells a client to call msync before every single read operation. This may seem counterintuitive, as it turns every read operation into two RPCs: msync to the Active following by an actual read operation to the Observer. However, the msync operation is extremely lightweight, as it does not acquire the FSNamesystemLock, and in experiments we have found that this approach can easily scale to well over 100,000 msync operations per second on the Active (while still servicing approx. 10,000 write op/s). Combined with the fast-path edit log tailing for standby/observer nodes (~~HDFS-13150~~), this "always msync" approach should introduce only a few ms of extra latency to each read call.

Below are some experimental results collected from experiments which convert a normal RPC workload into one in which all read operations are turned into an msync. The baseline is a workload of 1.5k write op/s and 25k read op/s.

Rate Multiplier	2	4	6	8
RPC Queue Avg Time (ms)	14	53	110	125
RPC Queue NumOps Avg (k)	51	102	147	177
RPC Queue NumOps Max (k)	148	269	306	312

(numbers are approximate and should be viewed primarily for their trends)

Results are promising up to between 4x and 6x of the baseline workload, which is approx. 100-150k read op/s.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HDFS-14211.000.patch
08/Mar/19 18:36
5 kB
Erik Krogen
HDFS-14211.001.patch
13/Mar/19 16:33
10 kB
Erik Krogen
HDFS-14211.002.patch
18/Mar/19 17:15
15 kB
Erik Krogen

Issue Links

is related to

HDFS-12943 Consistent Reads from Standby Node

Resolved

Activity

People

Assignee:: Erik Krogen

Reporter:: Erik Krogen

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 16/Jan/19 17:50

Updated:: 19/Mar/19 15:18

Resolved:: 19/Mar/19 15:16