Uploaded image for project: 'Apache Storm'
  1. Apache Storm
  2. STORM-2329

Topology halts when getting HDFS writer in a secure environment

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 0.10.1, 2.0.0, 1.x
    • None
    • storm-hdfs
    • Kerberos

    Description

      Simple topologies writing to Kerberized HDFS will sometimes stop while getting a new writer (storm-hdfs) in a Kerberized environment:

      java.io.IOException: Failed on local exception: java.io.IOException: Couldn't setup connection for principal@realm to nn1/nn1IP; Host Details : local host is: "hostname/ip"; destination host is: "nn hostname":8020; at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) at org.apache.hadoop.ipc.Client.call(Client.java:1473) at org.apache.hadoop.ipc.Client.call(Client.java:1400) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy26.create(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:296) at sun.reflect.GeneratedMethodAccessor44.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy27.create(Unknown Source) at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1726) at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1668) at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1593) at org.apache.hadoop.hdfs.DistributedFileSystem$6.doCall(DistributedFileSystem.java:397) at org.apache.hadoop.hdfs.DistributedFileSystem$6.doCall(DistributedFileSystem.java:393) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:393) at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:337) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:775) at org.apache.storm.hdfs.bolt.AvroGenericRecordBolt.makeNewWriter(AvroGenericRecordBolt.java:115) at org.apache.storm.hdfs.bolt.AbstractHdfsBolt.getOrCreateWriter(AbstractHdfsBolt.java:222) at org.apache.storm.hdfs.bolt.AbstractHdfsBolt.execute(AbstractHdfsBolt.java:154) at backtype.storm.daemon.executor$fn_3697$tuple_action_fn3699.invoke(executor.clj:670) at backtype.storm.daemon.executor$mk_task_receiver$fn3620.invoke(executor.clj:426) at backtype.storm.disruptor$clojure_handler$reify3196.onEvent(disruptor.clj:58) at backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:125) at backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:99) at backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:80) at backtype.storm.daemon.executor$fn3697$fn3710$fn3761.invoke(executor.clj:808) at backtype.storm.util$async_loop$fn_544.invoke(util.clj:475) at clojure.lang.AFn.run(AFn.java:22) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Couldn't setup connection for principal@realm to nn1/nn1IP:8020 at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:673) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:644) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:731) at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:369) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1522) at org.apache.hadoop.ipc.Client.call(Client.java:1439) ... 35 more Caused by: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211) at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:413) at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:554) at org.apache.hadoop.ipc.Client$Connection.access$1800(Client.java:369) at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:723) at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:719) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:718) ... 38 more Caused by: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt) at sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:147) at sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:122) at sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:187) at sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:224) at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212) at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179) at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:192) ... 47 more

      Typically seen on low throughput topologies but recently witnessed in a topology that rotates files within minutes.

      From the trace it happens here: https://github.com/apache/storm/blob/master/external/storm-hdfs/src/main/java/org/apache/storm/hdfs/bolt/AbstractHdfsBolt.java#L151

      My suspicion is this happens only when opening a new file otherwise I don't see why there would be a Kerberos context to complain about until a flush/sync perhaps.

      My shoot from the hip reaction is to pull that out of the current try and simply let the bolt fail and restart to establish a security context. Thoughts?

      Attachments

        1. STORM-2329.patch
          1.0 kB
          Kristopher Kane

        Activity

          People

            Unassigned Unassigned
            kristopherkane Kristopher Kane
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: