Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-10348

Allow RM to always cancel tokens after app completes

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.10.0, 3.1.3
    • 3.2.2, 2.10.1, 3.4.0, 3.3.1
    • yarn
    • None

    Description

      (Note: this change was originally done on our internal branch by daryn).

      The RM currently has an option for a client to specify disabling token cancellation when a job completes. This feature was an initial attempt to address the use case of a job launching sub-jobs (ie. oozie launcher) and the original job finishing prior to the sub-job(s) completion - ex. original job completion triggered premature cancellation of tokens needed by the sub-jobs.

      Many years ago, daryn added a more robust implementation to ref count tokens (YARN-3055). This prevented premature cancellation of the token until all apps using the token complete, and invalidated the need for a client to specify cancel=false. Unfortunately the config option was not removed.

      We have seen cases where oozie "java actions" and some users were explicitly disabling token cancellation. This can lead to a buildup of defunct tokens that may overwhelm the ZK buffer used by the KDC's backing store. At which point the KMS fails to connect to ZK and is unable to issue/validate new tokens - rendering the KDC only able to authenticate pre-existing tokens. Production incidents have occurred due to the buffer size issue.

      To avoid these issues, the RM should have the option to ignore/override the client's request to not cancel tokens.

      Attachments

        1. YARN-10348-branch-3.2.001.patch
          9 kB
          Jim Brennan
        2. YARN-10348.002.patch
          9 kB
          Jim Brennan
        3. YARN-10348.001.patch
          8 kB
          Jim Brennan

        Activity

          People

            jbrennan Jim Brennan
            jbrennan Jim Brennan
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: