Although I helped contrib to this jira, I have concerns regarding its safety and hope it's a temporary fix.
I feel the config is of questionable value since a misbehaving client may "forget" to cancel its tokens. The NN is holding tokens in memory so it could lead to a potential, and perhaps unintentional, denial of service attack.
When tokens are shared between jobs, it's ambiguous as to when the tokens can be safely cancelled. How does a client know that other running or queued jobs are using the tokens? If the client intends to launch multiple jobs, but the client errors out, the tokens can't be cancelled or "very bad" things will happen to the jobs already submitted. Tasks will pound on the NN every second with the bad token, and yarn tasks appear to run "forever" if rpc connections fail. In a test env, orphaned tasks had pounded on the NN every second for a month.
Allowing the RM to cancel tokens when the job completes, which implies tokens are good for one and only one job submission, removes the ambiguity of when it's safe to cancel the tokens. This reduces the chance of a dos attack on the NN, and from a security perspective closes the window of exposure vs. allowing tokens to linger until their lifetime expires.