Uploaded image for project: 'Apache YuniKorn'
  1. Apache YuniKorn
  2. YUNIKORN-1179

Logs are spammed with health check status messages

    XMLWordPrintableJSON

Details

    Description

      YUNIKORN-1107 introduced periodic background health check.

      The problem is, too much noise is printed to the console:

      2022-04-20T13:28:03.101Z	INFO	scheduler/health_checker.go:87	Scheduler is healthy	{"health check values": [{"Name":"Scheduling errors","Succeeded":true,"Description":"Check for scheduling error entries in metrics","DiagnosisMessage":"There were 0 scheduling errors logged in the metrics"},{"Name":"Failed nodes","Succeeded":true,"Description":"Check for failed nodes entries in metrics","DiagnosisMessage":"There were 0 failed nodes logged in the metrics"},{"Name":"Negative resources","Succeeded":true,"Description":"Check for negative resources in the partitions","DiagnosisMessage":"Partitions with negative resources: []"},{"Name":"Negative resources","Succeeded":true,"Description":"Check for negative resources in the nodes","DiagnosisMessage":"Nodes with negative resources: []"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if a node's allocated resource <= total resource of the node","DiagnosisMessage":"Nodes with inconsistent data: []"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if total partition resource == sum of the node resources from the partition","DiagnosisMessage":"Partitions with inconsistent data: []"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if node total resource = allocated resource + occupied resource + available resource","DiagnosisMessage":"Nodes with inconsistent data: []"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if node capacity >= allocated resources on the node","DiagnosisMessage":"Nodes with inconsistent data: []"},{"Name":"Reservation check","Succeeded":true,"Description":"Check the reservation nr compared to the number of nodes","DiagnosisMessage":"Reservation/node nr ratio: [0.000000]"},{"Name":"Orphan allocation on node check","Succeeded":true,"Description":"Check if there are orphan allocations on the nodes","DiagnosisMessage":"Orphan allocations: []"},{"Name":"Orphan allocation on app check","Succeeded":true,"Description":"Check if there are orphan allocations on the applications","DiagnosisMessage":"OrphanAllocations: []"}]}
      2022-04-20T13:28:33.098Z	INFO	scheduler/health_checker.go:87	Scheduler is healthy	{"health check values": [{"Name":"Scheduling errors","Succeeded":true,"Description":"Check for scheduling error entries in metrics","DiagnosisMessage":"There were 0 scheduling errors logged in the metrics"},{"Name":"Failed nodes","Succeeded":true,"Description":"Check for failed nodes entries in metrics","DiagnosisMessage":"There were 0 failed nodes logged in the metrics"},{"Name":"Negative resources","Succeeded":true,"Description":"Check for negative resources in the partitions","DiagnosisMessage":"Partitions with negative resources: []"},{"Name":"Negative resources","Succeeded":true,"Description":"Check for negative resources in the nodes","DiagnosisMessage":"Nodes with negative resources: []"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if a node's allocated resource <= total resource of the node","DiagnosisMessage":"Nodes with inconsistent data: []"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if total partition resource == sum of the node resources from the partition","DiagnosisMessage":"Partitions with inconsistent data: []"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if node total resource = allocated resource + occupied resource + available resource","DiagnosisMessage":"Nodes with inconsistent data: []"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if node capacity >= allocated resources on the node","DiagnosisMessage":"Nodes with inconsistent data: []"},{"Name":"Reservation check","Succeeded":true,"Description":"Check the reservation nr compared to the number of nodes","DiagnosisMessage":"Reservation/node nr ratio: [0.000000]"},{"Name":"Orphan allocation on node check","Succeeded":true,"Description":"Check if there are orphan allocations on the nodes","DiagnosisMessage":"Orphan allocations: []"},{"Name":"Orphan allocation on app check","Succeeded":true,"Description":"Check if there are orphan allocations on the applications","DiagnosisMessage":"OrphanAllocations: []"}]}
      

      I don't think we need that much output in every 30 seconds. In fact, if the scheduler is healthy, we don't need anything at all, maybe a short printout on DEBUG level, but nothing more.

      If the health check failed, then we might log it, but even in that case this looks unnecessary.

      Attachments

        Issue Links

          Activity

            People

              lowc1012 Ryan Lo
              pbacsko Peter Bacsko
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: