[YUNIKORN-1179] Logs are spammed with health check status messages - ASF JIRA

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.0.0
Component/s: core - scheduler
Labels:
- pull-request-available

Target Version:

1.0.0

Description

~~YUNIKORN-1107~~ introduced periodic background health check.

The problem is, too much noise is printed to the console:

2022-04-20T13:28:03.101Z	INFO	scheduler/health_checker.go:87	Scheduler is healthy	{"health check values": [{"Name":"Scheduling errors","Succeeded":true,"Description":"Check for scheduling error entries in metrics","DiagnosisMessage":"There were 0 scheduling errors logged in the metrics"},{"Name":"Failed nodes","Succeeded":true,"Description":"Check for failed nodes entries in metrics","DiagnosisMessage":"There were 0 failed nodes logged in the metrics"},{"Name":"Negative resources","Succeeded":true,"Description":"Check for negative resources in the partitions","DiagnosisMessage":"Partitions with negative resources: []"},{"Name":"Negative resources","Succeeded":true,"Description":"Check for negative resources in the nodes","DiagnosisMessage":"Nodes with negative resources: []"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if a node's allocated resource <= total resource of the node","DiagnosisMessage":"Nodes with inconsistent data: []"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if total partition resource == sum of the node resources from the partition","DiagnosisMessage":"Partitions with inconsistent data: []"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if node total resource = allocated resource + occupied resource + available resource","DiagnosisMessage":"Nodes with inconsistent data: []"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if node capacity >= allocated resources on the node","DiagnosisMessage":"Nodes with inconsistent data: []"},{"Name":"Reservation check","Succeeded":true,"Description":"Check the reservation nr compared to the number of nodes","DiagnosisMessage":"Reservation/node nr ratio: [0.000000]"},{"Name":"Orphan allocation on node check","Succeeded":true,"Description":"Check if there are orphan allocations on the nodes","DiagnosisMessage":"Orphan allocations: []"},{"Name":"Orphan allocation on app check","Succeeded":true,"Description":"Check if there are orphan allocations on the applications","DiagnosisMessage":"OrphanAllocations: []"}]}
2022-04-20T13:28:33.098Z	INFO	scheduler/health_checker.go:87	Scheduler is healthy	{"health check values": [{"Name":"Scheduling errors","Succeeded":true,"Description":"Check for scheduling error entries in metrics","DiagnosisMessage":"There were 0 scheduling errors logged in the metrics"},{"Name":"Failed nodes","Succeeded":true,"Description":"Check for failed nodes entries in metrics","DiagnosisMessage":"There were 0 failed nodes logged in the metrics"},{"Name":"Negative resources","Succeeded":true,"Description":"Check for negative resources in the partitions","DiagnosisMessage":"Partitions with negative resources: []"},{"Name":"Negative resources","Succeeded":true,"Description":"Check for negative resources in the nodes","DiagnosisMessage":"Nodes with negative resources: []"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if a node's allocated resource <= total resource of the node","DiagnosisMessage":"Nodes with inconsistent data: []"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if total partition resource == sum of the node resources from the partition","DiagnosisMessage":"Partitions with inconsistent data: []"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if node total resource = allocated resource + occupied resource + available resource","DiagnosisMessage":"Nodes with inconsistent data: []"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if node capacity >= allocated resources on the node","DiagnosisMessage":"Nodes with inconsistent data: []"},{"Name":"Reservation check","Succeeded":true,"Description":"Check the reservation nr compared to the number of nodes","DiagnosisMessage":"Reservation/node nr ratio: [0.000000]"},{"Name":"Orphan allocation on node check","Succeeded":true,"Description":"Check if there are orphan allocations on the nodes","DiagnosisMessage":"Orphan allocations: []"},{"Name":"Orphan allocation on app check","Succeeded":true,"Description":"Check if there are orphan allocations on the applications","DiagnosisMessage":"OrphanAllocations: []"}]}

I don't think we need that much output in every 30 seconds. In fact, if the scheduler is healthy, we don't need anything at all, maybe a short printout on DEBUG level, but nothing more.

If the health check failed, then we might log it, but even in that case this looks unnecessary.

Attachments

Issue Links

is caused by

YUNIKORN-1107 Make health check occur in the background

Closed

links to

GitHub Pull Request #406

Logs are spammed with health check status messages

Details

Description

Attachments

Issue Links

Activity

People

Dates