Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-1924

Tez AM does not register with AM with full FQDN causing jobs to fail in some environments

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.5.2
    • Fix Version/s: 0.5.4
    • Component/s: None
    • Labels:
      None
    • Target Version/s:

      Description

      Issue originally reported by [~Karam Singh].

      All OrderWordCount, WordCount and Tez tests faultTolerance system tests failed due to java.net.UnknownHostException
      Interesting other tez examples such as mrrsleep, randomwriter, randomtextwriter, sort, join_inner, join_outer, terasort, groupbyorderbymrrtest ran fine
      one such example is following

      RUNNING: /usr/lib/hadoop/bin/hadoop jar /usr/lib/tez/tez-mapreduce-examples-0.4.0.2.1.7.0-784.jar orderedwordcount "-DUSE_TEZ_SESSION=true" "-Dmapreduce.map.memory.mb=2048" "-Dtez.am.shuffle-vertex-manager.max-src-fraction=0" "-Dmapreduce.reduce.memory.mb=2048" "-Dmapreduce.framework.name=yarn-tez" "-Dtez.am.container.reuse.enabled=false" "-Dtez.am.log.level=DEBUG" "-Dmapreduce.map.java.opts=-Xmx1024m" "-Dtez.am.shuffle-vertex-manager.min-src-fraction=0" "-Dmapreduce.job.reduce.slowstart.completedmaps=0.01" "-Dmapreduce.reduce.java.opts=-Xmx1024m" "-Dtez.am.container.session.delay-allocation-millis=120000" /user/hrt_qa/Tez_CR_1/TestContainerReuse1 /user/hrt_qa/Tez_CROutput_1 /user/hrt_qa/Tez_CR_2/TestContainerReuse2 /user/hrt_qa/Tez_CROutput_2 -generateSplitsInClient true
      14/12/19 09:20:05 INFO impl.TimelineClientImpl: Timeline service address: http://0.0.0.0:8188/ws/v1/timeline/
      14/12/19 09:20:05 INFO client.RMProxy: Connecting to ResourceManager at headnode0.humb-tez1-ssh.d5.internal.cloudapp.net/10.0.0.87:8050
      14/12/19 09:20:05 INFO client.AHSProxy: Connecting to Application History server at /0.0.0.0:10200
      14/12/19 09:20:06 INFO impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
      14/12/19 09:20:06 INFO impl.MetricsSystemImpl: Scheduled snapshot period at 60 second(s).
      14/12/19 09:20:06 INFO impl.MetricsSystemImpl: azure-file-system metrics system started
      14/12/19 09:20:07 INFO client.TezClientUtils: Permissions on staging directory wasb://humb-tez1@humboldttesting.blob.core.windows.net/user/hrt_qa/.staging/application_1418977790315_0016 are incorrect: rwxr-xr-x. Fixing permissions to correct value rwx------
      14/12/19 09:20:07 INFO examples.OrderedWordCount: Creating Tez Session
      14/12/19 09:20:07 INFO impl.TimelineClientImpl: Timeline service address: http://0.0.0.0:8188/ws/v1/timeline/
      14/12/19 09:20:07 INFO client.RMProxy: Connecting to ResourceManager at headnode0.humb-tez1-ssh.d5.internal.cloudapp.net/10.0.0.87:8050
      14/12/19 09:20:07 INFO client.AHSProxy: Connecting to Application History server at /0.0.0.0:10200
      14/12/19 09:20:09 INFO impl.YarnClientImpl: Submitted application application_1418977790315_0016
      14/12/19 09:20:09 INFO examples.OrderedWordCount: Created Tez Session
      14/12/19 09:20:09 INFO examples.OrderedWordCount: Running OrderedWordCount DAG, dagIndex=1, inputPath=/user/hrt_qa/Tez_CR_1/TestContainerReuse1, outputPath=/user/hrt_qa/Tez_CROutput_1
      14/12/19 09:20:09 INFO hadoop.MRHelpers: Generating new input splits, splitsDir=wasb://humb-tez1@humboldttesting.blob.core.windows.net/user/hrt_qa/.staging/application_1418977790315_0016
      14/12/19 09:20:09 INFO input.FileInputFormat: Total input paths to process : 20
      14/12/19 09:20:09 INFO examples.OrderedWordCount: Waiting for TezSession to get into ready state
      14/12/19 09:20:14 INFO client.TezSession: Failed to retrieve AM Status via proxy
      org.apache.tez.dag.api.TezException: com.google.protobuf.ServiceException: java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "workernode1":59575; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost
      	at org.apache.tez.client.TezSession.getSessionStatus(TezSession.java:351)
      	at org.apache.tez.mapreduce.examples.OrderedWordCount.waitForTezSessionReady(OrderedWordCount.java:538)
      	at org.apache.tez.mapreduce.examples.OrderedWordCount.main(OrderedWordCount.java:461)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:606)
      	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
      	at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:145)
      	at org.apache.tez.mapreduce.examples.ExampleDriver.main(ExampleDriver.java:88)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:606)
      	at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
      Caused by: com.google.protobuf.ServiceException: java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "workernode1":59575; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost
      	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:216)
      	at com.sun.proxy.$Proxy24.getAMStatus(Unknown Source)
      	at org.apache.tez.client.TezSession.getSessionStatus(TezSession.java:337)
      	... 14 more
      Caused by: java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "workernode1":59575; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost
      	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
      	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
      	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
      	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
      	at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
      	at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:742)
      	at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:400)
      	at org.apache.hadoop.ipc.Client.getConnection(Client.java:1452)
      	at org.apache.hadoop.ipc.Client.call(Client.java:1381)
      	at org.apache.hadoop.ipc.Client.call(Client.java:1363)
      	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
      	... 16 more
      Caused by: java.net.UnknownHostException
      	... 21 more
      
      ....................
      ....................
      
      Caused by: java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "workernode1":59575; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost
      	at sun.reflect.GeneratedConstructorAccessor22.newInstance(Unknown Source)
      	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
      	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
      	at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
      	at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:742)
      	at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:400)
      	at org.apache.hadoop.ipc.Client.getConnection(Client.java:1452)
      	at org.apache.hadoop.ipc.Client.call(Client.java:1381)
      	at org.apache.hadoop.ipc.Client.call(Client.java:1363)
      	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
      	... 16 more
      Caused by: java.net.UnknownHostException
      	... 21 more
      14/12/19 09:25:19 ERROR examples.OrderedWordCount: Error occurred when submitting/running DAGs
      java.lang.RuntimeException: TezSession has already shutdown
      	at org.apache.tez.mapreduce.examples.OrderedWordCount.waitForTezSessionReady(OrderedWordCount.java:540)
      	at org.apache.tez.mapreduce.examples.OrderedWordCount.main(OrderedWordCount.java:461)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:606)
      	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
      	at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:145)
      	at org.apache.tez.mapreduce.examples.ExampleDriver.main(ExampleDriver.java:88)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:606)
      	at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
      14/12/19 09:25:19 INFO examples.OrderedWordCount: Shutting down session
      14/12/19 09:25:19 INFO client.TezSession: Shutting down Tez Session, sessionName=OrderedWordCountSession, applicationId=application_1418977790315_0016
      14/12/19 09:25:19 INFO client.TezSession: Failed to shutdown Tez Session via proxy
      org.apache.tez.dag.api.SessionNotRunning: Application not running, applicationId=application_1418977790315_0016, yarnApplicationState=FINISHED, finalApplicationStatus=SUCCEEDED, trackingUrl=http://headnode0.humb-tez1-ssh.d5.internal.cloudapp.net:8088/proxy/application_1418977790315_0016/A
      	at org.apache.tez.client.TezClientUtils.getSessionAMProxy(TezClientUtils.java:733)
      	at org.apache.tez.client.TezSession.stop(TezSession.java:281)
      	at org.apache.tez.mapreduce.examples.OrderedWordCount.main(OrderedWordCount.java:524)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:606)
      	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
      	at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:145)
      	at org.apache.tez.mapreduce.examples.ExampleDriver.main(ExampleDriver.java:88)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:606)
      	at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
      14/12/19 09:25:19 INFO client.TezSession: Could not connect to AM, killing session via YARN, sessionName=OrderedWordCountSession, applicationId=application_1418977790315_0016
      14/12/19 09:25:19 INFO impl.YarnClientImpl: Killed application application_1418977790315_0016
      java.lang.RuntimeException: TezSession has already shutdown
      	at org.apache.tez.mapreduce.examples.OrderedWordCount.waitForTezSessionReady(OrderedWordCount.java:540)
      	at org.apache.tez.mapreduce.examples.OrderedWordCount.main(OrderedWordCount.java:461)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:606)
      	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
      	at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:145)
      	at org.apache.tez.mapreduce.examples.ExampleDriver.main(ExampleDriver.java:88)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:606)
      	at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
      
      

      Contents of /etc/hosts are:

      127.0.0.1 localhost
      
      # The following lines are desirable for IPv6 capable hosts
      ::1 ip6-localhost ip6-loopback
      fe00::0 ip6-localnet
      ff00::0 ip6-mcastprefix
      ff02::1 ip6-allnodes
      ff02::2 ip6-allrouters
      ff02::3 ip6-allhosts
      

      and contents of resolv.conf are:

      # Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
      #     DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
      nameserver 168.63.129.16
      search humb-tez1-ssh.d5.internal.cloudapp.net
      

        Attachments

        1. TEZ-1924.2.patch
          0.9 kB
          Ivan Mitic
        2. TEZ-20.patch
          3 kB
          Ivan Mitic

          Activity

            People

            • Assignee:
              ivanmi Ivan Mitic
              Reporter:
              ivanmi Ivan Mitic
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: