After discussion with Devaraj and Owen, I summarize the approach here:
- Child.java can have cleanup code in finally block. This will make sure that the cleanup will happen if the failure is because of Exception/Error, this will cover a majority of cases.
- Any other type of fail or kill of the attempt makes it FAILED_UNCLEAN or KILLED_UNCLEAN. JobTracker will launch a separate cleanup task for FAILED_UNCLEAN and KILLED_UNCLEAN attempts. The cleanup task will take the attempt to FAILED or KILLED
- JT stops launching cleanup tasks for attempts once job succeeds/fails. As Devaraj told, this also means that the job level cleanup task (OutputCommitted.cleanupJob) has run, with the assumption that the job level cleanup has cleaned all garbage up.
Two approches here:
1. We can use the same attempt for launching the cleanup. Here, the same attempt will launched with starting state as *_UNCLEAN, instead of UNASSIGNED. When the cleanup is successful, it will go to FAILED or KILLED. If it fails, it will be left in *_UNCLEAN state.
We would need additional logic for scheduler to handle retries, if needed.
2. Have a separate tip for doing the cleanup. Associate the cleanup tip with failed/killed attempt, by passing the attempt_id through configuration.
Once the tip succeeds ( after four retry attempts, by default), it will move the corresponding attempt to FAILED or KILLED. If the tip fails, it will leave the attempt in *_UNCLEAN state.