Uploaded image for project: 'Apache YuniKorn'
  1. Apache YuniKorn
  2. YUNIKORN-1670

Application recovery can fail if app is rejected

    XMLWordPrintableJSON

Details

    Description

      During application recovery, the current code waits up to 30 seconds for all applications to transition to "Accepted". However, if an application is rejected, or if the cluster is large enough, recovery will not succeed.

      Similar to how informer sync was recently updated, we should modify the logic to keep trying, but log periodically. Additionally, we should not look specifically for Accepted state, but for state != New and != Recovering. This ensures that we have processed all the applicaitons.

      Attachments

        Issue Links

          Activity

            People

              ccondit Craig Condit
              ccondit Craig Condit
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: