Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-12649

Improve Kerberos diagnostics and failure handling

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.7.1
    • Fix Version/s: None
    • Component/s: security
    • Labels:
      None
    • Environment:

      Kerberos

      Description

      Sometimes —apparently— some people cannot get kerberos to work.

      The ability to diagnose problems here is hampered by some aspects of UGI

      1. the only way to turn on JAAS debug information is through an env var, not within the JVM
      2. failures are potentially underlogged
      3. exceptions raised are generic IOEs, so can't be trapped and filtered
      4. failure handling on the TGT renewer thread is nonexistent
      5. the code is barely-readable, underdocumented mess.

        Issue Links

          Activity

          Hide
          stevel@apache.org Steve Loughran added a comment -

          If you can't renew a ticket as you were kinited-in and it's expired, the renewer thread exits with nothing but a warning. It doesn't even print the stack trace of the nested exception.

          2015-12-16 12:57:44,005 [TGT Renewer for stevel@COTHAM] WARN  security.UserGroupInformation (run(914)) - Exception encountered while running the renewal command. Aborting renew thread. ExitCodeException exitCode=1: kinit: krb5_get_kdc_cred: Error from KDC: TKT_EXPIRED
          

          A near-silent failure is not always what you want. There is nothing to prevent a renewal-failure action to be provided to this thread, allowing an application-level action to be performed (maybe even retry)

          Show
          stevel@apache.org Steve Loughran added a comment - If you can't renew a ticket as you were kinited-in and it's expired, the renewer thread exits with nothing but a warning. It doesn't even print the stack trace of the nested exception. 2015-12-16 12:57:44,005 [TGT Renewer for stevel@COTHAM] WARN security.UserGroupInformation (run(914)) - Exception encountered while running the renewal command. Aborting renew thread. ExitCodeException exitCode=1: kinit: krb5_get_kdc_cred: Error from KDC: TKT_EXPIRED A near-silent failure is not always what you want. There is nothing to prevent a renewal-failure action to be provided to this thread, allowing an application-level action to be performed (maybe even retry)
          Hide
          stevel@apache.org Steve Loughran added a comment -

          Example: the only way to debug JAAS internals is to set the env var HADOOP_JAAS_DEBUG. It is therefore impossible to enable this from inside the JVM.

          Better: provide a method to turn this or, and/or hook it up to the log level of UGI itself. That is if the UGI log is at debug, turn JAAS debug on automatically

          Show
          stevel@apache.org Steve Loughran added a comment - Example: the only way to debug JAAS internals is to set the env var HADOOP_JAAS_DEBUG . It is therefore impossible to enable this from inside the JVM. Better: provide a method to turn this or, and/or hook it up to the log level of UGI itself. That is if the UGI log is at debug, turn JAAS debug on automatically
          Hide
          aw Allen Wittenauer added a comment -

          There is nothing better than waking up to find some magic, undocumented env var like HADOOP_JAAS_DEBUG.

          Show
          aw Allen Wittenauer added a comment - There is nothing better than waking up to find some magic, undocumented env var like HADOOP_JAAS_DEBUG.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          SLIDER-1027 covers a kerberos diagnostics command line entry I'm adding in slider; it has to go into the hadoop.security package to be able to force keytab renewal.

          It's a bit limited in what it can debug; there's not enough information for diagnostics, and when things like the renewer thread exit, there is no obvious way to determine the fact. At the very least have some bool we can probe to see if the thread is running

          Show
          stevel@apache.org Steve Loughran added a comment - SLIDER-1027 covers a kerberos diagnostics command line entry I'm adding in slider; it has to go into the hadoop.security package to be able to force keytab renewal. It's a bit limited in what it can debug; there's not enough information for diagnostics, and when things like the renewer thread exit, there is no obvious way to determine the fact. At the very least have some bool we can probe to see if the thread is running
          Hide
          stevel@apache.org Steve Loughran added a comment -

          Except the secret sysprops needed for JRE debugging, like sun.security.spnego.debug

          See also: https://steveloughran.gitbooks.io/kerberos_and_hadoop/content/sections/secrets.html

          Show
          stevel@apache.org Steve Loughran added a comment - Except the secret sysprops needed for JRE debugging, like sun.security.spnego.debug See also: https://steveloughran.gitbooks.io/kerberos_and_hadoop/content/sections/secrets.html
          Hide
          aw Allen Wittenauer added a comment -

          Opened HADOOP-12650 to basically audit the entire code base for all of the getenvs in the Java code.

          Show
          aw Allen Wittenauer added a comment - Opened HADOOP-12650 to basically audit the entire code base for all of the getenvs in the Java code.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          loginUserFromKeytabAndReturnUGI could perhaps check its parameters —especially principal— for being non-null; and that the keytab exists and is non-empty. It could then fail with more useful messages than "login failure for user null"

          Show
          stevel@apache.org Steve Loughran added a comment - loginUserFromKeytabAndReturnUGI could perhaps check its parameters —especially principal— for being non-null; and that the keytab exists and is non-empty. It could then fail with more useful messages than "login failure for user null"
          Hide
          stevel@apache.org Steve Loughran added a comment -

          ZOOKEEPER-2344 would benefit from some of this

          Show
          stevel@apache.org Steve Loughran added a comment - ZOOKEEPER-2344 would benefit from some of this
          Hide
          stevel@apache.org Steve Loughran added a comment -

          some (package scoped at least) access to User and their LoginContext would help debug that

          Show
          stevel@apache.org Steve Loughran added a comment - some (package scoped at least) access to User and their LoginContext would help debug that
          Hide
          stevel@apache.org Steve Loughran added a comment -

          the isLoginKeytabBased and isLoginTicketBased calls could be exported by the UGI instances; currently they are static and hard-coded to the login user

          Show
          stevel@apache.org Steve Loughran added a comment - the isLoginKeytabBased and isLoginTicketBased calls could be exported by the UGI instances; currently they are static and hard-coded to the login user
          Hide
          stevel@apache.org Steve Loughran added a comment -

          diagnostics should look up hadoop.kerberos.kinit.command, verify it is present, and list its details

          Show
          stevel@apache.org Steve Loughran added a comment - diagnostics should look up hadoop.kerberos.kinit.command , verify it is present, and list its details
          Hide
          stevel@apache.org Steve Loughran added a comment -

          note also that having a specific kerberos subclass of IOE means that retry handlers can bail out on a kerberos problem. There is no point retrying on a kerberos failure, as it isn't going to go away

          Show
          stevel@apache.org Steve Loughran added a comment - note also that having a specific kerberos subclass of IOE means that retry handlers can bail out on a kerberos problem. There is no point retrying on a kerberos failure, as it isn't going to go away
          Hide
          stevel@apache.org Steve Loughran added a comment -

          SLIDER-1035 is where I'm implementing the core of the diagnostics, with a plan to move over

          Show
          stevel@apache.org Steve Loughran added a comment - SLIDER-1035 is where I'm implementing the core of the diagnostics, with a plan to move over
          Hide
          stevel@apache.org Steve Loughran added a comment -

          + make "hadoop.kerberos.kinit.command" a string constant rather than a string in UGI

          Show
          stevel@apache.org Steve Loughran added a comment - + make "hadoop.kerberos.kinit.command" a string constant rather than a string in UGI
          Hide
          stevel@apache.org Steve Loughran added a comment -

          + add an option to run the kdiag diagnostics code during container launch? Tricky as it's really something the launched app would need to do, not the NM.

          Show
          stevel@apache.org Steve Loughran added a comment - + add an option to run the kdiag diagnostics code during container launch? Tricky as it's really something the launched app would need to do, not the NM.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          YARN-4629 starts to add some yarn-client code to help manage credential setup

          Show
          stevel@apache.org Steve Loughran added a comment - YARN-4629 starts to add some yarn-client code to help manage credential setup
          Hide
          stevel@apache.org Steve Loughran added a comment -

          HADOOP-12906 creates a false alarm about auth; it's really a 404

          Show
          stevel@apache.org Steve Loughran added a comment - HADOOP-12906 creates a false alarm about auth; it's really a 404
          Hide
          xiaochen Xiao Chen added a comment -

          Linking HADOOP-13590 here which does #4 of the description. Reviews appreciated.

          Show
          xiaochen Xiao Chen added a comment - Linking HADOOP-13590 here which does #4 of the description. Reviews appreciated.
          Hide
          drankye Kai Zheng added a comment -

          So many good things ... I wish to be able to spend some time on these. Would give HADOOP-13590 a look.

          Show
          drankye Kai Zheng added a comment - So many good things ... I wish to be able to spend some time on these. Would give HADOOP-13590 a look.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          We're collecting patches for work under here? Any volunteers for reviewing them?

          Show
          stevel@apache.org Steve Loughran added a comment - We're collecting patches for work under here? Any volunteers for reviewing them?

            People

            • Assignee:
              Unassigned
              Reporter:
              stevel@apache.org Steve Loughran
            • Votes:
              1 Vote for this issue
              Watchers:
              25 Start watching this issue

              Dates

              • Created:
                Updated:

                Development