Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
0.18
Description
Currently experiments fail when
- HPC queue reaches the max job number for the queue.
- When the job submission fails and HPC sent job submission response [1]airavata tags the experiment as FAILED.
- The only option for gateway user is to submit the experiment again.
Fix required is to Airavata to have internal queues or a way to manage such experiments until the HPC queue is available for jobs and not to FAIL the experiment.
When enabling internal Airavata queues, we need to focus on keeping queues per gateway, per HPC resource per gateway login user, etc. These implementation details need to be discussed and finalized and input will also be required from HPC system administrators as well.
[1]
This example os from stampede2
----------------------------------------------------------------- Welcome to the Stampede2 Supercomputer ----------------------------------------------------------------- No reservation for this job --> Verifying valid submit host (login3)...OK --> Verifying valid jobname...OK --> Enforcing max jobs per user...FAILED [*] Too many simultaneous jobs in queue. --> Max job limits for us3 = 50 jobs