Uploaded image for project: 'Airavata'
  1. Airavata
  2. AIRAVATA-3872

Computing resource node failure and job re-queue handing

    XMLWordPrintableJSON

Details

    • Easy

    Description

      This issue was experienced in time to time, this time in production Ultrascan gateway, https://django.ultrascan.scigap.org/. This gateway is connected to the production stack an Django portal for admin operations.

      When a job is submitted and queued a node failure happens, when this failure is notified through email notification job goes to UNKNOWN state in the gateway. In the remote cluster, the job gets re-queued and completed, and email notifications are sent. The Helix identifies UNKNOWN as a final job state and does not process emails sent after.

      Currently, when this happens, an operational task takes care of updating the job status and processing the email notifications sent.

       

       

       

       

      Attachments

        Activity

          People

            dimuthuupe Dimuthu
            eroma_a Eroma
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: