My organization has been bitten by a mistake in the way we have written Java action applications. I would like to introduce a documentation change that might reduce the likelihood that others new to Oozie make the same mistake.
The mistake is not accounting for the possibility that launcher tasks will fail due to reasons such as cluster maintenance. We have a number of jobs that take input and output paths as arguments. Our code had been specifically written such that if the output path already exists the job fails to avoid inadvertently deleting an output that may have been consumed by a downstream job.
This has bitten us during cluster maintenance that requires TaskTracker restarts. During such an event any launcher running on a TaskTracker at the time of the TaskTracker restart fails and is retried on another TaskTracker. The new attempt of the launcher task fails due to the output directory already existing. This in turn fails the whole workflow. Maintenance that requires restarting all TaskTrackers can end up causing a lot of workflow failures.
The current documentation does hint at such issues via mention of the “prepare” block, but I don’t think the explanation of this block is clear enough for newcomers to understand its use. Furthermore, I’m not sure the prepare block is the best answer for how to handle the specific types of issues I am referring to. A “delete” action in a prepare block will delete content regardless of state, which provides the possibility that a previously completed good output could be deleted. This can lead to issues such as corrupted traceability when there is a need to trace an output back to the inputs that produced it.
I believe a more appropriate implementation to address the possibility of launcher task failure is to write the action such that it uses a previous complete output without deleting or reprocessing. Only if it detects an incomplete output does it delete the output and re-run the processing to produce the output. This protects from the possibility of accidental output destruction.
Furthermore, some types of actions spawn activity that runs asynchronously outside the context of the launcher task itself. In such cases the action author must take care to clean up any stray activity spawned prior to the failure of the initial launcher task to ensure it does not collide with activity produced by the new attempt of the launcher. In the case of my organization, such activity includes child M/R jobs spawned from the Apache Crunch pipelines we invoke from our Java actions. Depending on the design of the action, it can be required to find and kill such child jobs before invoking the new pipeline and spawning new child jobs.
I will attach a patch that demonstrates one possible documentation improvement to shed light on these issues but I appreciate feedback and any other ideas.