Description
Suppose you have a workflow like this:
start --> fork fork --> shell1, shell2 shell1 --> join shell2 --> join join --> shell3 shell3 --> end
And all but shell3 are successful.
Assuming you fix the problem with shell3, if you do a rerun, the following two outcomes can happen:
- If shell1 finished before shell2, then the rerun succeeds
- If shell2 finished before shell1, then the rerun fails
The error in the second outcome is simply this log message:
2014-05-29 17:17:03,735 ERROR org.apache.oozie.workflow.lite.LiteWorkflowInstance: SERVER[cdh5-1.cloudera.local] USER[pdvorak] GROUP[-] TOKEN[] APP[test-rerun-wf] JOB[0000004-140521220856264-oozie-oozi-W] ACTION[0000004-140521220856264-oozie-oozi-W@join] invalid execution path [/shell1/]
After a bunch of digging, I discovered that during a rerun with the above workflow or similar workflows, LiteWorkflowInstance#signal gets called for each action in the fork node in the order that they are listed in the fork node's XML; however, during the original run, LiteWorkflowInstance#signal gets called for each action in the order that they complete (i.e. endTime). When these don't match, you get the above error. The general fix for this is therefore to ensure that during a rerun, LiteWorkflowInstance#signal gets called for each action in the fork node in the order that they originally ran in. And if you think about it, that is more correct than the current behavior anyway.
Attachments
Attachments
Issue Links
- breaks
-
OOZIE-1993 Rerun fails during join in certain condition
- Resolved
-
OOZIE-1989 NPE during a rerun with forks
- Closed