Uploaded image for project: 'Apache YuniKorn'
  1. Apache YuniKorn
  2. YUNIKORN-1615

Node occupied resource is negative

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      After some tasks complete, the Yunikorn scheduler reported node used resource with negative resource and it cause the scheduling in chaos. I tried to restart the scheduler and it will report negative resource eventually after complete some tasks. In Yunikorn scheduler log I found the following log:

      2023-03-01T18:10:40.038Z        INFO    cache/nodes.go:140      report occupied resources updates       {"node": "172.18.45.234", "request": {"nodes":[{"nodeID":"172.18.45.234","action":2,"attributes":{"ready":"true"},"schedulableResource":{"resources":{"ephemeral-storage":{"value":520021631754},"hugepages-1Gi":{},"hugepages-2Mi":{},"memory":{"value":201131376640},"pods":{"value":110},"vcore":{"value":40000}}},"occupiedResource":{"resources":{"memory":{"value":-10126244160},"vcore":{"value":-9700}}}}],"rmID":"k8s_dios"}}
      2023-03-01T18:10:44.635Z        INFO    cache/nodes.go:140      report occupied resources updates       {"node": "172.18.45.228", "request": {"nodes":[{"nodeID":"172.18.45.228","action":2,"attributes":{"ready":"true"},"schedulableResource":{"resources":{"ephemeral-storage":{"value":249175645796},"hugepages-1Gi":{},"hugepages-2Mi":{},"memory":{"value":269682475008},"pods":{"value":110},"vcore":{"value":40000}}},"occupiedResource":{"resources":{"memory":{"value":-10314987840},"vcore":{"value":-9400}}}}],"rmID":"k8s_dios"}}
      2023-03-01T18:10:44.870Z        INFO    cache/nodes.go:140      report occupied resources updates       {"node": "172.18.45.230", "request": {"nodes":[{"nodeID":"172.18.45.230","action":2,"attributes":{"ready":"true"},"schedulableResource":{"resources":{"ephemeral-storage":{"value":249175645796},"hugepages-1Gi":{},"hugepages-2Mi":{},"memory":{"value":269682475008},"pods":{"value":110},"vcore":{"value":40000}}},"occupiedResource":{"resources":{"memory":{"value":-8829204224},"vcore":{"value":-8500}}}}],"rmID":"k8s_dios"}}
      2023-03-01T18:10:49.279Z        INFO    cache/nodes.go:140      report occupied resources updates       {"node": "172.18.45.235", "request": {"nodes":[{"nodeID":"172.18.45.235","action":2,"attributes":{"ready":"true"},"schedulableResource":{"resources":{"ephemeral-storage":{"value":520021631754},"hugepages-1Gi":{},"hugepages-2Mi":{},"memory":{"value":201131372544},"pods":{"value":110},"vcore":{"value":40000}}},"occupiedResource":{"resources":{"memory":{"value":-8504048512},"vcore":{"value":-7800}}}}],"rmID":"k8s_dios"}}
      2023-03-01T18:15:42.686Z        INFO    cache/nodes.go:140      report occupied resources updates       {"node": "172.18.45.230", "request": {"nodes":[{"nodeID":"172.18.45.230","action":2,"attributes":{"ready":"true"},"schedulableResource":{"resources":{"ephemeral-storage":{"value":249175645796},"hugepages-1Gi":{},"hugepages-2Mi":{},"memory":{"value":269682475008},"pods":{"value":110},"vcore":{"value":40000}}},"occupiedResource":{"resources":{"memory":{"value":-9902946048},"vcore":{"value":-9500}}}}],"rmID":"k8s_dios"}}
      2023-03-01T18:15:43.857Z        INFO    cache/nodes.go:140      report occupied resources updates       {"node": "172.18.45.234", "request": {"nodes":[{"nodeID":"172.18.45.234","action":2,"attributes":{"ready":"true"},"schedulableResource":{"resources":{"ephemeral-storage":{"value":520021631754},"hugepages-1Gi":{},"hugepages-2Mi":{},"memory":{"value":201131376640},"pods":{"value":110},"vcore":{"value":40000}}},"occupiedResource":{"resources":{"memory":{"value":-11199985984},"vcore":{"value":-10700}}}}],"rmID":"k8s_dios"}}
      2023-03-01T18:15:49.229Z        INFO    cache/nodes.go:140      report occupied resources updates       {"node": "172.18.45.235", "request": {"nodes":[{"nodeID":"172.18.45.235","action":2,"attributes":{"ready":"true"},"schedulableResource":{"resources":{"ephemeral-storage":{"value":520021631754},"hugepages-1Gi":{},"hugepages-2Mi":{},"memory":{"value":201131372544},"pods":{"value":110},"vcore":{"value":40000}}},"occupiedResource":{"resources":{"memory":{"value":-9577790336},"vcore":{"value":-8800}}}}],"rmID":"k8s_dios"}}
      2023-03-01T18:15:54.457Z        INFO    cache/nodes.go:140      report occupied resources updates       {"node": "172.18.45.228", "request": {"nodes":[{"nodeID":"172.18.45.228","action":2,"attributes":{"ready":"true"},"schedulableResource":{"resources":{"ephemeral-storage":{"value":249175645796},"hugepages-1Gi":{},"hugepages-2Mi":{},"memory":{"value":269682475008},"pods":{"value":110},"vcore":{"value":40000}}},"occupiedResource":{"resources":{"memory":{"value":-11388729664},"vcore":{"value":-10400}}}}],"rmID":"k8s_dios"}}

      Yunikorn UI

       

      Health Check Result & Log

       

      2023-03-02T03:25:52.310Z        WARN    scheduler/health_checker.go:176 Scheduler is not healthy        {"health check values": [{"Name":"Scheduling errors","Succeeded":true,"Description":"Check for scheduling error entries in metrics","DiagnosisMessage":"There were 0 scheduling errors logged in the metrics"},{"Name":"Failed nodes","Succeeded":true,"Description":"Check for failed nodes entries in metrics","DiagnosisMessage":"There were 0 failed nodes logged in the metrics"},{"Name":"Negative resources","Succeeded":true,"Description":"Check for negative resources in the partitions","DiagnosisMessage":"Partitions with negative resources: []"},{"Name":"Negative resources","Succeeded":false,"Description":"Check for negative resources in the nodes","DiagnosisMessage":"Nodes with negative resources: [\"172.18.45.228\" \"172.18.45.235\" \"172.18.45.234\" \"172.18.45.230\"]"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if a node's allocated resource <= total resource of the node","DiagnosisMessage":"Nodes with inconsistent data: []"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if total partition resource == sum of the node resources from the partition","DiagnosisMessage":"Partitions with inconsistent data: []"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if node total resource = allocated resource + occupied resource + available resource","DiagnosisMessage":"Nodes with inconsistent data: []"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if node capacity >= allocated resources on the node","DiagnosisMessage":"Nodes with inconsistent data: []"},{"Name":"Reservation check","Succeeded":true,"Description":"Check the reservation nr compared to the number of nodes","DiagnosisMessage":"Reservation/node nr ratio: [0.000000]"},{"Name":"Orphan allocation on node check","Succeeded":true,"Description":"Check if there are orphan allocations on the nodes","DiagnosisMessage":"Orphan allocations: []"},{"Name":"Orphan allocation on app check","Succeeded":true,"Description":"Check if there are orphan allocations on the applications","DiagnosisMessage":"OrphanAllocations: []"}]} 

       

       

      Kubekubernetes version
      Server Version: version.Info

      {Major:"1", Minor:"20", GitVersion:"v1.20.8", GitCommit:"5575935422cc1cf5169dfc8847cb587aa47bac5a", GitTreeState:"clean", BuildDate:"2021-06-16T12:53:07Z", GoVersion:"go1.15.13", Compiler:"gc", Platform:"linux/amd64"}

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            ccondit Craig Condit
            kej1 Jie Ke
            Votes:
            2 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment