Uploaded image for project: 'Apache YuniKorn'
  1. Apache YuniKorn
  2. YUNIKORN-1597

Gang scheduling: application might not transition to Running after recovery

    XMLWordPrintableJSON

Details

    Description

      Pods get suck in a certain recovery scenario which involves gang scheduling.

      High level overview:
      1. All placeholders are running and allocated
      2. The real pod is in Pending state
      3. Yunikorn crashes and recovers

      In this case, the real pod will not transition to Running. It's because:
      1. Upon recovery, the state of recovered tasks will be set to "Allocated", not "Bound".
      2. If placeholder tasks are already running and allocated, there will be no call to postTaskBound().

      A possible fix:
      1. In Task.initialize(), set the state to Bound if it's a placeholder task.
      2. In Application.onReserving(), check if we have placeholders. If we do, that means we're after recovery, so send an "UpdateReservation" event.

      Attachments

        Activity

          People

            pbacsko Peter Bacsko
            pbacsko Peter Bacsko
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: