[AIRAVATA-2941] Experiments fail to submit jobs to HPC cluster queues due to queue reaching the max job limit per user. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 0.18
Fix Version/s: 0.18
Component/s: GFac, helix implementation
Labels:
- gsoc2020
Environment:
https://staging.ultrascan.scigap.org & https://ultrascan.scigap.org/

Description

Currently experiments fail when

HPC queue reaches the max job number for the queue.
When the job submission fails and HPC sent job submission response [1]airavata tags the experiment as FAILED.
The only option for gateway user is to submit the experiment again.

Fix required is to Airavata to have internal queues or a way to manage such experiments until the HPC queue is available for jobs and not to FAIL the experiment.

When enabling internal Airavata queues, we need to focus on keeping queues per gateway, per HPC resource per gateway login user, etc. These implementation details need to be discussed and finalized and input will also be required from HPC system administrators as well.

[1]

This example os from stampede2

----------------------------------------------------------------- Welcome to the Stampede2 Supercomputer ----------------------------------------------------------------- No reservation for this job --> Verifying valid submit host (login3)...OK --> Verifying valid jobname...OK --> Enforcing max jobs per user...FAILED [*] Too many simultaneous jobs in queue. --> Max job limits for us3 = 50 jobs

Attachments

Activity

People

Assignee:: Shameera

Reporter:: Eroma

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 12/Nov/18 21:14

Updated:: 26/Mar/20 18:45