Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
Noticed multiple time in our production.
If one the action in fork fail with a transient error ( but succeeded after few retries), they never join.
This happens when on the action is fork fails to submit a job.
Oozie queues command as queue(this, retryDelayMillis) on transient error. ActionStartXCommand doesn't load job if its is not null.
Before ActionStartXCommand runs again, other actions have already started which has modified job info. ActionStartXCommand still contains old info, which writes to DB and we miss some workflow instance data.