Description
We had one user submitting way too many workflows with single hive query - ~3600 workflows running concurrently. Surprisingly Oozie held up well without issues.
But daryn from our hadoop team saw that the amount of delegation tokens fetched by Oozie was very high compared to actual number of jobs submitted and was stressing RM with the calls and also pushing it close to its memory limits. This is because we are fetching the delegation token every time we create a JobClient instead of only during job submission.
So for one job we fetch
1) 1 token during submission
2) 1 token every 5 minutes when we check status of job
3) 1 token after the job ends to retrieve status.
4) 1 token if we are killing the job.
So for a job running for 11 minutes, we would have fetched the token 4 times. May be more in other cases like mapreduce where we check for end of launcher and child job.
Only 1 out of the token (used in the job submission) will be cancelled after job completes. Other tokens are kind of leaked and will only be cleaned up by RM after the expiry period (24 hrs is default). This can make RM go out of memory.