[AIRAVATA-3872] Computing resource node failure and job re-queue handing - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: helix implementation
Labels:
None
Environment:
https://django.ultrascan.scigap.org/

difficulty-level:
Easy

Description

This issue was experienced in time to time, this time in production Ultrascan gateway, https://django.ultrascan.scigap.org/. This gateway is connected to the production stack an Django portal for admin operations.

When a job is submitted and queued a node failure happens, when this failure is notified through email notification job goes to UNKNOWN state in the gateway. In the remote cluster, the job gets re-queued and completed, and email notifications are sent. The Helix identifies UNKNOWN as a final job state and does not process emails sent after.

Currently, when this happens, an operational task takes care of updating the job status and processing the email notifications sent.

Attachments

Activity

People

Assignee:: Dimuthu

Reporter:: Eroma

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 27/Feb/24 08:44

Updated:: 27/Feb/24 08:44