We've had a few users who have run into problems where submitting a workflow appears to hang (in the case of a subworkflow, it's similar but stuck in PREP). It turns out that if you wait long enough, it will actually go through and the workflow will run normally. The problem is that the forkjoin validation code is taking a really long time.
The attached example has a series of 20 forks where each fork has 6 actions (it's based on an actual workflow, but all of the names were changed and the actions were all replaced by simple shell actions). One of our support guys said it took 1-2 hours , but on my computer it was taking 15+ hours (I had to cancel it)
While this example doesn't have any nested forks, those can also take a long time too.
It's easy to verify that it's the forkjoin validation code that's taking so long by looking at a jstack of the Oozie server and seeing deep recursive calls to org.apache.oozie.workflow.lite.LiteWorkflowAppParser.validateForkJoin. I also noticed a lot of sitting around in calls LinkedList.contains.
I think we have 3 options:
- See if we can make the existing code faster somehow. Perhaps there's a way to parallelize it? Maybe there's some redundant checking that we can identify and skip? Change some data structures? etc
- See if we can write a new way to do this validation. I had originally completely rewritten this code a while ago, and we've since made a few fixes to catch edge cases and things. Perhaps it needs another rewrite?
- Try to identify when it's taking a long time and at least let the user know what's happening or something. Right now, it just appears that the Oozie CLI has hung and the job doesn't show up in the Oozie server. Most users aren't going to wait more than a minute or two.