[AIRAVATA-2943] Re-queueing and node failures in HPC clusters need to be handled in gateway middleware as resubmitting failures - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.18
Fix Version/s: 0.18
Component/s: helix implementation
Labels:
None
Environment:
https://staging.ultrascan.scigap.org slurm job ID 8560 in Jetstream

Description

Currently in clusters (PBS and SLURM) jobs are getting either re-queued due to node failures. In such scenarios the jobs are been executed after re-queueing but on gateway side it is taken as a FAILED job at the initial NODE_FAIL.

These types of failures need to be captured as retrying failures instead of taking it as an end result.

Attachments

Activity

People

Assignee:: Dimuthu

Reporter:: Eroma

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 13/Nov/18 20:04

Updated:: 01/Mar/19 22:16

Resolved:: 01/Mar/19 22:16