XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 1.5.0
    • 1.5.1, 1.6.0
    • None
    • None
    • Yunikorn: 1.5
      AWS EKS: v1.28.6-eks-508b6b3

    Description

      Discussion on Yunikorn slack: https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1711048995187179

      Occasionally, Yunikorn will deadlock and prevent any new pods from starting. All pods stay in Pending. There are no error logs inside of the Yunikorn scheduler indicating any issue. 

      Additionally, the pods all have the correct annotations / labels from the admission service, so they are at least getting put into k8s correctly. 

      The issue was seen intermittently on Yunikorn version 1.5 in EKS, using version `v1.28.6-eks-508b6b3`. 

      At least for me, we run about 25-50 nodes and 200-400 pods. Pods and nodes are added and removed pretty frequently as we do ML workloads. 

      Attached is the goroutine dump. We were not able to get a statedump as the endpoint kept timing out. 

      You can fix it by restarting the Yunikorn scheduler pod. Sometimes you also have to delete any "Pending" pods that got stuck while the scheduler was deadlocked as well, for them to get picked up by the new scheduler pod. 

      Attachments

        1. deadlock_2024-04-18.log
          11 kB
          Xi Chen
        2. logs-splunk-ordered.txt
          15.92 MB
          Shravan Achar
        3. logs-splunk.txt
          47.77 MB
          Shravan Achar
        4. running-logs-2.txt
          6.25 MB
          Shravan Achar
        5. profile001-4-5.gif
          192 kB
          Shravan Achar
        6. goroutine-4-5.out
          991 kB
          Shravan Achar
        7. running-logs.txt
          48.88 MB
          Shravan Achar
        8. logs-potential-deadlock.txt
          14 kB
          Shravan Achar
        9. logs-potential-deadlock-2.txt
          14 kB
          Shravan Achar
        10. 0001-YUNIKORN-2539-core.patch
          43 kB
          Craig Condit
        11. 0002-YUNIKORN-2539-k8shim.patch
          31 kB
          Craig Condit
        12. 4_4_profile003.png
          449 kB
          Noah Yoshida
        13. 4_4_profile001.png
          407 kB
          Noah Yoshida
        14. 4_4_profile002.png
          508 kB
          Noah Yoshida
        15. 4_4_scheduler-logs.txt
          180 kB
          Noah Yoshida
        16. 4_4_goroutine-4.txt
          86 kB
          Noah Yoshida
        17. 4_4_goroutine-3.txt
          86 kB
          Noah Yoshida
        18. 4_4_goroutine-5-state-dump.txt
          88 kB
          Noah Yoshida
        19. 4_4_goroutine-2.txt
          85 kB
          Noah Yoshida
        20. 4_4_goroutine-1.txt
          85 kB
          Noah Yoshida
        21. goroutine-while-blocking.out
          1.53 MB
          Shravan Achar
        22. goroutine-while-blocking-2.out
          8 kB
          Shravan Achar
        23. goroutine-4-3-3.out
          1.39 MB
          Shravan Achar
        24. goroutine-4-3.out
          1.16 MB
          Shravan Achar
        25. profile013.gif
          184 kB
          Shravan Achar
        26. profile012.gif
          184 kB
          Shravan Achar
        27. goroutine-4-3-2.out
          8 kB
          Shravan Achar
        28. goroutine-4-3-1.out
          8 kB
          Shravan Achar
        29. goroutine-dump.txt
          32 kB
          Noah Yoshida

        Issue Links

          Activity

            People

              pbacsko Peter Bacsko
              nyoshida Noah Yoshida
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: