Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-1924

Tez AM does not register with AM with full FQDN causing jobs to fail in some environments

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.5.2
    • Fix Version/s: 0.5.4
    • Component/s: None
    • Labels:
      None
    • Target Version/s:

      Description

      Issue originally reported by [~Karam Singh].

      All OrderWordCount, WordCount and Tez tests faultTolerance system tests failed due to java.net.UnknownHostException
      Interesting other tez examples such as mrrsleep, randomwriter, randomtextwriter, sort, join_inner, join_outer, terasort, groupbyorderbymrrtest ran fine
      one such example is following

      RUNNING: /usr/lib/hadoop/bin/hadoop jar /usr/lib/tez/tez-mapreduce-examples-0.4.0.2.1.7.0-784.jar orderedwordcount "-DUSE_TEZ_SESSION=true" "-Dmapreduce.map.memory.mb=2048" "-Dtez.am.shuffle-vertex-manager.max-src-fraction=0" "-Dmapreduce.reduce.memory.mb=2048" "-Dmapreduce.framework.name=yarn-tez" "-Dtez.am.container.reuse.enabled=false" "-Dtez.am.log.level=DEBUG" "-Dmapreduce.map.java.opts=-Xmx1024m" "-Dtez.am.shuffle-vertex-manager.min-src-fraction=0" "-Dmapreduce.job.reduce.slowstart.completedmaps=0.01" "-Dmapreduce.reduce.java.opts=-Xmx1024m" "-Dtez.am.container.session.delay-allocation-millis=120000" /user/hrt_qa/Tez_CR_1/TestContainerReuse1 /user/hrt_qa/Tez_CROutput_1 /user/hrt_qa/Tez_CR_2/TestContainerReuse2 /user/hrt_qa/Tez_CROutput_2 -generateSplitsInClient true
      14/12/19 09:20:05 INFO impl.TimelineClientImpl: Timeline service address: http://0.0.0.0:8188/ws/v1/timeline/
      14/12/19 09:20:05 INFO client.RMProxy: Connecting to ResourceManager at headnode0.humb-tez1-ssh.d5.internal.cloudapp.net/10.0.0.87:8050
      14/12/19 09:20:05 INFO client.AHSProxy: Connecting to Application History server at /0.0.0.0:10200
      14/12/19 09:20:06 INFO impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
      14/12/19 09:20:06 INFO impl.MetricsSystemImpl: Scheduled snapshot period at 60 second(s).
      14/12/19 09:20:06 INFO impl.MetricsSystemImpl: azure-file-system metrics system started
      14/12/19 09:20:07 INFO client.TezClientUtils: Permissions on staging directory wasb://humb-tez1@humboldttesting.blob.core.windows.net/user/hrt_qa/.staging/application_1418977790315_0016 are incorrect: rwxr-xr-x. Fixing permissions to correct value rwx------
      14/12/19 09:20:07 INFO examples.OrderedWordCount: Creating Tez Session
      14/12/19 09:20:07 INFO impl.TimelineClientImpl: Timeline service address: http://0.0.0.0:8188/ws/v1/timeline/
      14/12/19 09:20:07 INFO client.RMProxy: Connecting to ResourceManager at headnode0.humb-tez1-ssh.d5.internal.cloudapp.net/10.0.0.87:8050
      14/12/19 09:20:07 INFO client.AHSProxy: Connecting to Application History server at /0.0.0.0:10200
      14/12/19 09:20:09 INFO impl.YarnClientImpl: Submitted application application_1418977790315_0016
      14/12/19 09:20:09 INFO examples.OrderedWordCount: Created Tez Session
      14/12/19 09:20:09 INFO examples.OrderedWordCount: Running OrderedWordCount DAG, dagIndex=1, inputPath=/user/hrt_qa/Tez_CR_1/TestContainerReuse1, outputPath=/user/hrt_qa/Tez_CROutput_1
      14/12/19 09:20:09 INFO hadoop.MRHelpers: Generating new input splits, splitsDir=wasb://humb-tez1@humboldttesting.blob.core.windows.net/user/hrt_qa/.staging/application_1418977790315_0016
      14/12/19 09:20:09 INFO input.FileInputFormat: Total input paths to process : 20
      14/12/19 09:20:09 INFO examples.OrderedWordCount: Waiting for TezSession to get into ready state
      14/12/19 09:20:14 INFO client.TezSession: Failed to retrieve AM Status via proxy
      org.apache.tez.dag.api.TezException: com.google.protobuf.ServiceException: java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "workernode1":59575; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost
      	at org.apache.tez.client.TezSession.getSessionStatus(TezSession.java:351)
      	at org.apache.tez.mapreduce.examples.OrderedWordCount.waitForTezSessionReady(OrderedWordCount.java:538)
      	at org.apache.tez.mapreduce.examples.OrderedWordCount.main(OrderedWordCount.java:461)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:606)
      	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
      	at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:145)
      	at org.apache.tez.mapreduce.examples.ExampleDriver.main(ExampleDriver.java:88)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:606)
      	at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
      Caused by: com.google.protobuf.ServiceException: java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "workernode1":59575; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost
      	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:216)
      	at com.sun.proxy.$Proxy24.getAMStatus(Unknown Source)
      	at org.apache.tez.client.TezSession.getSessionStatus(TezSession.java:337)
      	... 14 more
      Caused by: java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "workernode1":59575; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost
      	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
      	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
      	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
      	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
      	at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
      	at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:742)
      	at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:400)
      	at org.apache.hadoop.ipc.Client.getConnection(Client.java:1452)
      	at org.apache.hadoop.ipc.Client.call(Client.java:1381)
      	at org.apache.hadoop.ipc.Client.call(Client.java:1363)
      	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
      	... 16 more
      Caused by: java.net.UnknownHostException
      	... 21 more
      
      ....................
      ....................
      
      Caused by: java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "workernode1":59575; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost
      	at sun.reflect.GeneratedConstructorAccessor22.newInstance(Unknown Source)
      	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
      	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
      	at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
      	at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:742)
      	at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:400)
      	at org.apache.hadoop.ipc.Client.getConnection(Client.java:1452)
      	at org.apache.hadoop.ipc.Client.call(Client.java:1381)
      	at org.apache.hadoop.ipc.Client.call(Client.java:1363)
      	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
      	... 16 more
      Caused by: java.net.UnknownHostException
      	... 21 more
      14/12/19 09:25:19 ERROR examples.OrderedWordCount: Error occurred when submitting/running DAGs
      java.lang.RuntimeException: TezSession has already shutdown
      	at org.apache.tez.mapreduce.examples.OrderedWordCount.waitForTezSessionReady(OrderedWordCount.java:540)
      	at org.apache.tez.mapreduce.examples.OrderedWordCount.main(OrderedWordCount.java:461)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:606)
      	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
      	at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:145)
      	at org.apache.tez.mapreduce.examples.ExampleDriver.main(ExampleDriver.java:88)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:606)
      	at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
      14/12/19 09:25:19 INFO examples.OrderedWordCount: Shutting down session
      14/12/19 09:25:19 INFO client.TezSession: Shutting down Tez Session, sessionName=OrderedWordCountSession, applicationId=application_1418977790315_0016
      14/12/19 09:25:19 INFO client.TezSession: Failed to shutdown Tez Session via proxy
      org.apache.tez.dag.api.SessionNotRunning: Application not running, applicationId=application_1418977790315_0016, yarnApplicationState=FINISHED, finalApplicationStatus=SUCCEEDED, trackingUrl=http://headnode0.humb-tez1-ssh.d5.internal.cloudapp.net:8088/proxy/application_1418977790315_0016/A
      	at org.apache.tez.client.TezClientUtils.getSessionAMProxy(TezClientUtils.java:733)
      	at org.apache.tez.client.TezSession.stop(TezSession.java:281)
      	at org.apache.tez.mapreduce.examples.OrderedWordCount.main(OrderedWordCount.java:524)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:606)
      	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
      	at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:145)
      	at org.apache.tez.mapreduce.examples.ExampleDriver.main(ExampleDriver.java:88)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:606)
      	at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
      14/12/19 09:25:19 INFO client.TezSession: Could not connect to AM, killing session via YARN, sessionName=OrderedWordCountSession, applicationId=application_1418977790315_0016
      14/12/19 09:25:19 INFO impl.YarnClientImpl: Killed application application_1418977790315_0016
      java.lang.RuntimeException: TezSession has already shutdown
      	at org.apache.tez.mapreduce.examples.OrderedWordCount.waitForTezSessionReady(OrderedWordCount.java:540)
      	at org.apache.tez.mapreduce.examples.OrderedWordCount.main(OrderedWordCount.java:461)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:606)
      	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
      	at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:145)
      	at org.apache.tez.mapreduce.examples.ExampleDriver.main(ExampleDriver.java:88)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:606)
      	at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
      
      

      Contents of /etc/hosts are:

      127.0.0.1 localhost
      
      # The following lines are desirable for IPv6 capable hosts
      ::1 ip6-localhost ip6-loopback
      fe00::0 ip6-localnet
      ff00::0 ip6-mcastprefix
      ff02::1 ip6-allnodes
      ff02::2 ip6-allrouters
      ff02::3 ip6-allhosts
      

      and contents of resolv.conf are:

      # Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
      #     DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
      nameserver 168.63.129.16
      search humb-tez1-ssh.d5.internal.cloudapp.net
      
      1. TEZ-1924.2.patch
        0.9 kB
        Ivan Mitic
      2. TEZ-20.patch
        3 kB
        Ivan Mitic

        Activity

        Hide
        ivanmi Ivan Mitic added a comment -

        I think I have the root cause at this point. Tez client is trying to talk to its AM, and given that AM is registered with a short host name (workernode0), Tez client is failing to talk to it. If Tez AM registered with the RM using a FQDN we would not have this problem.

        Show
        ivanmi Ivan Mitic added a comment - I think I have the root cause at this point. Tez client is trying to talk to its AM, and given that AM is registered with a short host name (workernode0), Tez client is failing to talk to it. If Tez AM registered with the RM using a FQDN we would not have this problem.
        Hide
        ivanmi Ivan Mitic added a comment -

        Attaching the patch.

        Patch is modeled based on what MRv2 AM is doing. Basically, Tez AM should use NodeManager's supplied hostname when it registers with the RM.

        Show
        ivanmi Ivan Mitic added a comment - Attaching the patch. Patch is modeled based on what MRv2 AM is doing. Basically, Tez AM should use NodeManager's supplied hostname when it registers with the RM.
        Hide
        hitesh Hitesh Shah added a comment -

        Thanks for filing the issue Ivan Mitic and also for providing a patch.

        Some general comments:

        • It is usually better if the patch file is named the same as the jira ( with a version number for multiple iterations on the patch ).
        • With respect to using the NM hostname, would it be better to extract the FQDN from the server object itself if possible?
        Show
        hitesh Hitesh Shah added a comment - Thanks for filing the issue Ivan Mitic and also for providing a patch. Some general comments: It is usually better if the patch file is named the same as the jira ( with a version number for multiple iterations on the patch ). With respect to using the NM hostname, would it be better to extract the FQDN from the server object itself if possible?
        Hide
        ivanmi Ivan Mitic added a comment -

        Thanks Hitesh for the quick review!

        Attaching the updated addressing your comments.

        Show
        ivanmi Ivan Mitic added a comment - Thanks Hitesh for the quick review! Attaching the updated addressing your comments.
        Hide
        hitesh Hitesh Shah added a comment -

        +1. Looks good. Committing shortly.

        Show
        hitesh Hitesh Shah added a comment - +1. Looks good. Committing shortly.
        Hide
        hitesh Hitesh Shah added a comment -

        Committed to master, branch 0.5 and branch 0.6. Thanks for your contribution Ivan Mitic

        Show
        hitesh Hitesh Shah added a comment - Committed to master, branch 0.5 and branch 0.6. Thanks for your contribution Ivan Mitic
        Hide
        ivanmi Ivan Mitic added a comment -

        Thanks for the quick turnaround Hitesh Shah!

        Show
        ivanmi Ivan Mitic added a comment - Thanks for the quick turnaround Hitesh Shah !
        Hide
        hitesh Hitesh Shah added a comment -

        Closing issue as 0.5.4, 0.6.1 and 0.7.0 have been released.

        Show
        hitesh Hitesh Shah added a comment - Closing issue as 0.5.4, 0.6.1 and 0.7.0 have been released.

          People

          • Assignee:
            ivanmi Ivan Mitic
            Reporter:
            ivanmi Ivan Mitic
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development