Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
When handleUpdateJobConfigArrival, a new job config gets posted, GobblinHelixJobScheduler will firstly stop and delete the old job, and try to spin up the updated helix workflow.
The job scheduler will try to do the stop synchronically with a default 10 seconds timeout setting. However, this stop constantly running longer than the timeout for Helix, causing the job state not correctly updated as stopped. Thus, when construct the GobblinHelixJobLauncher, we will have the previous job in a wrong state as jobRunningMap is not updated yet, causing the new job won’t being launched. So we always see this log: Job {} will not be executed because other jobs are still running.
We can make the job delete asynchronized, and let waitForJobCompletion method to ensure the job status get updated correctly eventually.