[HDDS-4539] Container Health Task should not run until Recon has reached steady state. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.4.0
Component/s: Ozone Recon
Labels:
- pull-request-available

Description

On a cluster with millions of containers or hundreds of Datanodes, it will take some time for Recon to reach a steady state (all active DNs and Containers reported). If the container health task is run before this, it can incorrectly flag most of the containers as missing. This was seen in a cluster where Recon reaching steady state is slow due to ~~HDDS-4403~~, and it also leads to the UI problem mentioned in ~~HDDS-4402~~.

We need to make sure the container health task is not run before cluster has reached steady state. This could be a fixed wait time (~10mins) or by checking Recon's SCM state.

Attachments

Issue Links

links to

GitHub Pull Request #4049

Activity

People

Assignee:: Devesh Kumar Singh

Reporter:: Aravindan Vijayan

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 02/Dec/20 19:32

Updated:: 09/Jan/23 19:25

Resolved:: 09/Jan/23 19:24