[HDFS-16165] Backport the Hadoop 3.x Kerberos synchronization fix to Hadoop 2.x - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Wish
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
- Confluent
Environment:

Hide

Can be reproduced in docker HDFS environment with Kerberos https://github.com/vdesabou/kafka-docker-playground/blob/93a93de293ad2f9bb22afb244f2d8729a178296e/connect/connect-hdfs2-sink/hdfs2-sink-ha-kerberos-repro-gss-exception.sh

Show
Can be reproduced in docker HDFS environment with Kerberos https://github.com/vdesabou/kafka-docker-playground/blob/93a93de293ad2f9bb22afb244f2d8729a178296e/connect/connect-hdfs2-sink/hdfs2-sink-ha-kerberos-repro-gss-exception.sh

Target Version/s:

2.10.3
Flags:

Patch

Description

Problem Description

For more than a year Apache Kafka Connect users have been running into a Kerberos renewal issue that causes our HDFS2 connectors to fail.

We have been able to consistently reproduce the issue under high load with 40 connectors (threads) that use the library. When we try an alternate workaround that uses the kerberos keytab on the system the connector operates without issues.

We identified the root cause to be a race condition bug in the Hadoop 2.x library that causes the ticker renewal to fail with the error below:

Caused by: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
 at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)We reached the conclusion of the root cause once we tried the same environment (40 connectors) with Hadoop 3.x, and our HDFS3 connectors and operated without renewal issues. Additionally, identifying that the synchronization issue has been fixed for the newer Hadoop 3.x releases  we confirmed our hypothesis about the root cause. Request

There are many changes in HDFS 3 UserGroupInformation.java related to UGI synchronization which were done as part of https://issues.apache.org/jira/browse/HADOOP-9747, and those changes suggest some race conditions were happening with older version, i.e HDFS 2.x Which would explain why we can reproduce the problem with HDFS2.
For example(among others):

  private void relogin(HadoopLoginContext login, boolean ignoreLastLoginTime)
      throws IOException {
    // ensure the relogin is atomic to avoid leaving credentials in an
    // inconsistent state.  prevents other ugi instances, SASL, and SPNEGO
    // from accessing or altering credentials during the relogin.
    synchronized(login.getSubjectLock()) {
      // another racing thread may have beat us to the relogin.
      if (login == getLogin()) {
        unprotectedRelogin(login, ignoreLastLoginTime);
      }
    }
  }

All those changes were not backported to Hadoop 2.x (out HDFS2 connector uses 2.10.1), on which several CDH distributions are based.

Request
We would like to ask for the synchronization fix to be backported to Hadoop 2.x so that our users can operate without issues.

Impact
The older 2.x Hadoop version is used by our HDFS connector, which is used in production by our community. Currently, the issue causes our HDFS connector to fail, as it is unable to recover and renew the ticket at a later point. Having the backported fix would allow our users to operate without issues that require manual intervention every week (or few days in some cases). The only workaround available to community for the issue is to run a command or restart their workers.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Daniel Osvath

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 12/Aug/21 17:55

Updated:: 24/May/22 08:04