Uploaded image for project: 'Airavata'
  1. Airavata
  2. AIRAVATA-2943

Re-queueing and node failures in HPC clusters need to be handled in gateway middleware as resubmitting failures

    XMLWordPrintableJSON

Details

    Description

      Currently in clusters (PBS and SLURM) jobs are getting either re-queued due to node failures. In such scenarios the jobs are been executed after re-queueing but on gateway side it is taken as a FAILED job at the initial NODE_FAIL. 

      These types of failures need to be captured as retrying failures instead of taking it as an end result.

      Attachments

        Activity

          People

            dimuthuupe Dimuthu
            eroma_a Eroma
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: