[HDFS-16942] Send error to datanode if FBR is rejected due to bad lease - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.4.0, 3.3.6
Fix Version/s: 3.4.0, 3.2.5, 3.3.6
Component/s: datanode, namenode
Labels:
- pull-request-available

Target Version/s:

3.4.0, 3.3.6
Hadoop Flags:

Reviewed

Description

When a datanode sends a FBR to the namenode, it requires a lease to send it. On a couple of busy clusters, we have seen an issue where the DN is somehow delayed in sending the FBR after requesting the least. Then the NN rejects the FBR and logs a message to that effect, but from the Datanodes point of view, it thinks the report was successful and does not try to send another report until the 6 hour default interval has passed.

If this happens to a few DNs, there can be missing and under replicated blocks, further adding to the cluster load. Even worse, I have see the DNs join the cluster with zero blocks, so it is not obvious the under replication is caused by lost a FBR, as all DNs appear to be up and running.

I believe we should propagate an error back to the DN if the FBR is rejected, that way, the DN can request a new lease and try again.

Attachments

Issue Links

links to

GitHub Pull Request #5460

GitHub Pull Request #5478

Activity

People

Assignee:: Stephen O'Donnell

Reporter:: Stephen O'Donnell

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 07/Mar/23 14:25

Updated:: 28/Jan/24 07:14

Resolved:: 11/Mar/23 17:45