I took a look into this this morning, and I believe I've gotten to the bottom of it.
The trouble is not that the delegation token still identifies the first NN. In the above comment, note that you output the delegation token of the first job (_0001), which we would not expect to have changed now that you're running a new job (_0002).
The trouble in fact seems to be that a delegation token issues by a second-listed NN, if used to communicate with the first-listed NN, will throw an exception that the client failover code doesn't interpret as warranting a failover/retry to the other NN.
Mingjie: I believe a simpler way to reproduce this issue would just be to start up a fresh cluster, make the second-listed NN active, and then try to run a job. Can you check on that?
The crux of the issue is that the client failover code, when initially trying to connect to the first-listed NN, will use the delegation token issued by the second-listed NN, which the first-listed NN isn't aware of. This causes a SaslException to be thrown in the o.a.h.ipc.Server, before we even get to the NN RPC code which would check an operation category and throw a StandbyException, which the client failover code knows how to deal with.
This wasn't caught in
HDFS-2904 testing because that test case gets a DT from the first-listed NN, fails over to the second-listed NN, and then ensures that the client can still communicate with the second-listed NN using the DT. However, in this scenario, both NNs are in fact aware of the DT, since the first-listed NN issued the DT, and before becoming active the second-listed NN reads all of edits logs and therefore is aware of the DT.
One solution I can think of is to make the SecretManager in the o.a.h.ipc.Server aware of the current HA state and throw a StandbyException if the NN isn't currently active. I'm not in love with this solution, as it leaks abstractions all over the place, so would welcome alternative suggestions.