Details
-
Bug
-
Status: Closed
-
Blocker
-
Resolution: Fixed
-
2.6.0
-
Reviewed
Description
The RM used to globally track the unique set of tokens for all apps. It remembered the first job that was submitted with the token. The first job controlled the cancellation of the token. This prevented completion of sub-jobs from canceling tokens used by the main job.
As of YARN-2704, the RM now tracks tokens on a per-app basis. There is no notion of the first/main job. This results in sub-jobs canceling tokens and failing the main job and other sub-jobs. It also appears to schedule multiple redundant renewals.
The issue is not immediately obvious because the RM will cancel tokens ~10 min (NM livelyness interval) after log aggregation completes. The result is an oozie job, ex. pig, that will launch many sub-jobs over time will fail if any sub-jobs are launched >10 min after any sub-job completes. If all other sub-jobs complete within that 10 min window, then the issue goes unnoticed.
Attachments
Attachments
Issue Links
- is broken by
-
YARN-2704 Localization and log-aggregation will fail if hdfs delegation token expired after token-max-life-time
-
- Closed
-
- is related to
-
HIVE-10992 WebHCat should not create delegation tokens when Kerberos is not enabled
-
- Resolved
-
-
YARN-3190 NM can't aggregate logs: token can't be found in cache
-
- Resolved
-
-
YARN-3055 The token is not renewed properly if it's shared by jobs (oozie) in DelegationTokenRenewer
-
- Closed
-
-
YARN-3439 RM fails to renew token when Oozie launcher leaves before sub-job finishes
-
- Closed
-