I looked into this a bit more recently.
The way Oozie kills a job is to tell Hadoop to kill the launcher job. The launcher job doesn't write the child ids until after they're finished, right before the launcher itself finishes (e.g. Pig gets run, then the launcher writes the ids of all of the jobs launched by Pig to a file). However, when the launcher job gets killed by Hadoop, it doesn't write the file, which means that Oozie doesn't have the child IDs so it can't kill them.
So in order to get this to work, I think we'd have to do some non-trivial refactoring of how the launcher jobs work. Some ideas I had were:
- Make the launcher job multithreaded so a second thread can go and figure out the jobs immediately when their IDs are available and write that to the file and keep updating the file. This way, when the launcher is killed, Oozie will have the child IDs (or at least most of them). This may not be possible for all action types.
- This would require a lot of changes and make things really complicated, but having the launcher job listen on a port or accept a REST call or something similar; instead of asking hadoop to kill the launcher job, Oozie would send it a command on that port/REST/etc so that the launcher could take care of more "nicely" killing the job, including any children and then itself. This would probably also open up some security concerns.
I don't really see a clean solution or one that we can easily apply to all action types