Uploaded image for project: 'Apache YuniKorn'
  1. Apache YuniKorn
  2. YUNIKORN-657

Expose reason of application failure to pods

    XMLWordPrintableJSON

Details

    Description

      An application may fail for a number of reasons. For example,

      • In gang scheduling, placeholders have expired before all of them can be successfully allocated
      • When no placement rules are defined (i.e. static queues are used), an application is submitted to an non-existent queue
      • The total amount of resources requested by a gang-scheduled app exceeds the capacity of the queue

      YK's the finite state machine has Failed as a terminal state of an app, meaning that YK won't try to bring back a failed app ever again. The consequence is that pods of such failed apps will be stuck in pending indefinitely. A better behavior is for YK to mark those pods as failed too, while also passing the reason of the failure to those pods.

      Attachments

        Issue Links

          Activity

            People

              yuchaoran2011 Chaoran Yu
              yuchaoran2011 Chaoran Yu
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: