Uploaded image for project: 'Apache YuniKorn'
  1. Apache YuniKorn
  2. YUNIKORN-549

Scheduler recovery failure occasionally while recovering a large number of applications

    XMLWordPrintableJSON

Details

    Description

      Current recovery logic adds application back to based on the pods reported by the informers/listers. In some conditions, the recovery of an app could fail if the app has both Running and Pending pods. This is because the shim marks an app with a Recovery if the informer notified the scheduler before the lister function gets called. The is not working as expected consistently, we need a stable implementation in order to tell if an app needs recovery or not.

      Attachments

        Issue Links

          Activity

            People

              wwei Weiwei Yang
              wwei Weiwei Yang
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: