Uploaded image for project: 'Falcon'
  1. Falcon
  2. FALCON-1595

In secure cluster, Falcon server loses ability to communicate with HDFS over time

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.8
    • 0.9
    • None
    • None

    Description

      In a kerberos secured cluster where the Kerberos ticket validity is one day, Falcon server eventually lost the ability to read and write to and from HDFS. In the logs we saw typical Kerberos-related errors like "GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)".

      2015-10-28 00:04:59,517 INFO  - [LaterunHandler:] ~ Creating FS impersonating user testUser (HadoopClientFactory:197)
      2015-10-28 00:04:59,519 WARN  - [LaterunHandler:] ~ Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] (Client:680)
      2015-10-28 00:04:59,520 WARN  - [LaterunHandler:] ~ Late Re-run failed for instance sample-process:2015-10-28T03:58Z after 420000 (AbstractRerunConsumer:84)
      java.io.IOException: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "sample.host.com/127.0.0.1"; destination host is: "sample.host.com":8020; 
      	at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
      	at org.apache.hadoop.ipc.Client.call(Client.java:1431)
      	at org.apache.hadoop.ipc.Client.call(Client.java:1358)
      	...
      Caused by: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
      	at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:685)
      	...
      

      The root cause of the issue is that TGT can expire. The TGT should be valid when accessing the namenode/a kerberos protected server and not when doing uri.getAuthority(). The best location in code to do this is in HadoopClientFactory.createFileSystem(...)

      Attachments

        1. FALCON-1595.patch
          2 kB
          Balu Vellanki

        Issue Links

          Activity

            People

              bvellanki Balu Vellanki
              bvellanki Balu Vellanki
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: