The problem was found to be that the JT couldn't contact the remote NN to renew a token due to a firewall. The tasks on the DNs were however able to contact the remote NN so the job succeeded. However, the job would have failed if it executed past the token expiration since the JT was unable to renew the token.
If the JT has to acquire tokens for a job, and acquisition fails, the job will fail. This is the ideal behavior, but there's a loophole... If the JT finds the token in the job's token cache, then it "assumes" the token must valid. The reality may be that the token is invalid, canceled, long expired, or the NN can't even be reached. In all of these cases, the tasks get fired off anyway, just to clog up a cluster while they die a long slow death. Actually, on 23, it's been observed that tasks using an invalid token will pound on the NN every second – on one cluster this happened for a month!
The JT immediately issues a token renewal and then uses a timer for future renewals. However, all renewals are done in a thread which means if the initial renewal fails because the token is bad, the job starts anyway. The simple solution is for the first renewal to occur in the job's context so an exception will kill the job, and future renewals to remain thread-based.