[HDFS-17030] Limit wait time for getHAServiceState in ObserverReaderProxy - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 3.4.0
Fix Version/s: 3.4.0, 3.3.9
Component/s: hdfs
Labels:
- pull-request-available

Target Version/s:

3.4.0, 3.3.9
Hadoop Flags:

Reviewed

Description

When namenode HA is enabled and a standby NN is not responsible, we have observed it would take a long time to serve a request, even though we have a healthy observer or active NN.

Basically, when a standby is down, the RPC client would (re)try to create socket connection to that standby for ipc.client.connect.timeout * ipc.client.connect.max.retries.on.timeouts before giving up. When we take a heap dump at a standby, the NN still accepts the socket connection but it won't send responses to these RPC requests and we would timeout after ipc.client.rpc-timeout.ms. This adds a significantly latency. For clusters at Linkedin, we set ipc.client.rpc-timeout.ms to 120 seconds and thus a request takes more than 2 mins to complete when we take a heap dump at a standby. This has been causing user job failures.

We could set ipc.client.rpc-timeout.ms to a smaller value when sending getHAServiceState requests in ObserverReaderProxy (for user rpc requests, we still use the original value from the config). However, that would double the socket connection between clients and the NN (which is a deal-breaker).

The proposal is to add a timeout on getHAServiceState() calls in ObserverReaderProxy and we will only wait for the timeout for an NN to respond its HA state. Once we pass that timeout, we will move on to probe the next NN.

Attachments

Issue Links

links to

GitHub Pull Request #5700 for trunk

GitHub Pull Request #5878

GitHub Pull Request #5878 for branch-3.3

Activity

People

Assignee:: Xing Lin

Reporter:: Xing Lin

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 29/May/23 21:29

Updated:: 12/Dec/23 17:51

Resolved:: 14/Jun/23 17:54