Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-16165

Backport the Hadoop 3.x Kerberos synchronization fix to Hadoop 2.x

    XMLWordPrintableJSON

Details

    • Patch

    Description

      Problem Description

      For more than a year Apache Kafka Connect users have been running into a Kerberos renewal issue that causes our HDFS2 connectors to fail.

      We have been able to consistently reproduce the issue under high load with 40 connectors (threads) that use the library. When we try an alternate workaround that uses the kerberos keytab on the system the connector operates without issues.

      We identified the root cause to be a race condition bug in the Hadoop 2.x library that causes the ticker renewal to fail with the error below:

      Caused by: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
       at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)We reached the conclusion of the root cause once we tried the same environment (40 connectors) with Hadoop 3.x, and our HDFS3 connectors and operated without renewal issues. Additionally, identifying that the synchronization issue has been fixed for the newer Hadoop 3.x releases  we confirmed our hypothesis about the root cause. Request
      

      There are many changes in HDFS 3 UserGroupInformation.java related to UGI synchronization which were done as part of https://issues.apache.org/jira/browse/HADOOP-9747, and those changes suggest some race conditions were happening with older version, i.e HDFS 2.x Which would explain why we can reproduce the problem with HDFS2.
      For example(among others):

        private void relogin(HadoopLoginContext login, boolean ignoreLastLoginTime)
            throws IOException {
          // ensure the relogin is atomic to avoid leaving credentials in an
          // inconsistent state.  prevents other ugi instances, SASL, and SPNEGO
          // from accessing or altering credentials during the relogin.
          synchronized(login.getSubjectLock()) {
            // another racing thread may have beat us to the relogin.
            if (login == getLogin()) {
              unprotectedRelogin(login, ignoreLastLoginTime);
            }
          }
        }
      

      All those changes were not backported to Hadoop 2.x (out HDFS2 connector uses 2.10.1), on which several CDH distributions are based.

      Request
      We would like to ask for the synchronization fix to be backported to Hadoop 2.x so that our users can operate without issues.

      Impact
      The older 2.x Hadoop version is used by our HDFS connector, which is used in production by our community. Currently, the issue causes our HDFS connector to fail, as it is unable to recover and renew the ticket at a later point. Having the backported fix would allow our users to operate without issues that require manual intervention every week (or few days in some cases). The only workaround available to community for the issue is to run a command or restart their workers.

      Attachments

        Activity

          People

            Unassigned Unassigned
            dosvath Daniel Osvath
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: