Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
0.17
-
None
Description
Zhong with the dREG gateway reported an experiment where the status was "stuck" in EXECUTING but the job had status COMPLETED. It looks like what happened is that the api-orch service on gw56 was shutdown probably at the same time that the orchestrator was handling the COMPLETED process status message. The process status subscriber automatically acks messages so it was taken out of the queue and not available when the orchestrator was restarted.
In gfac's log, the process completes at 2017-02-17 13:41:01
2017-02-17 13:41:01 [pool-9-thread-11] INFO o.a.a.g.core.context.ProcessContext - expId: Clone_of_2M_data_82c732b8-5bd5-4e24-b1cc-ce3fd480d677, processId: PROCESS_3b22553a-b9ed-4250-a1dd-8b555ecede80 :- Process status changed OUTPUT_DATA_S
api-orch was shut down and restarted several times around the same time
2017-02-17 13:37:03 [main] INFO o.a.a.api.server.AiravataAPIServer - API server started over TLS on Port: 9930 ... ... 2017-02-17 13:40:23 [main] INFO o.a.a.api.server.AiravataAPIServer - API server started over TLS on Port: 9930 ... ... 2017-02-17 13:43:02 [main] INFO o.a.a.api.server.AiravataAPIServer - API server started over TLS on Port: 9930 ... ... 2017-02-17 13:48:23 [main] INFO o.a.a.api.server.AiravataAPIServer - API server started over TLS on Port: 9930 ... ... 2017-02-17 14:10:58 [main] INFO o.a.a.api.server.AiravataAPIServer - API server started over TLS on Port: 9930 ...
A couple of solution ideas:
- make the status queue subscriber set to acknowledge messages
- have the orchestrator check the process status in the registry for every incomplete experiment when it starts up