[HDFS-14806] Bootstrap standby may fail if used in-progress tailing - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.10.0
Fix Version/s: 3.3.0, 3.1.4, 3.2.2, 2.10.1
Component/s: namenode
Labels:
None

Description

One issue we went across was that if in-progress tailing is enabled, bootstrap standby could fail.

When in-progress tailing is enabled, Bootstrap uses the RPC mechanism to get edits. There is a config dfs.ha.tail-edits.qjm.rpc.max-txns that sets an upper bound on how many txnid can be included in one RPC call. The default is 5000. Meaning bootstraping NN (say NN1) can only pull at most 5000 edits from JN. However, as part of bootstrap, NN1 queries another NN (say NN2) for NN2's current transactionID, NN2 may return a state that is > 5000 txnid from NN1's current image. But NN1 can only see 5000 more txnid from JNs. At this point NN1 goes panic, because txnid retuned by JNs is behind NN2's returned state, bootstrap then fail.

Essentially, bootstrap standby can fail if both of two following conditions are met:

in-progress tailing is enabled AND
the boostraping NN is too far (>5000 txid) behind

Increasing the value of dfs.ha.tail-edits.qjm.rpc.max-txns to some super large value allowed bootstrap to continue. But this is hardly the ideal solution.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HDFS-14806.001.patch
30/Aug/19 22:32
3 kB
Chen Liang
HDFS-14806.002.patch
04/Sep/19 17:49
0.9 kB
Erik Krogen
HDFS-14806.003.patch
05/Sep/19 20:12
4 kB
Chen Liang
HDFS-14806.004.patch
04/Nov/19 23:29
4 kB
Chen Liang

Activity

People

Assignee:: Chen Liang

Reporter:: Chen Liang

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 30/Aug/19 22:31

Updated:: 07/Nov/19 00:31

Resolved:: 06/Nov/19 17:31