Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-2630

TezChild receives IP address instead of FQDN

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 0.5.0, 0.6.0, 0.7.0
    • 0.8.0-alpha, 0.7.1, 0.5.5, 0.6.3
    • None
    • None

    Description

      I am running a yarn cluster on AWS. The slave nodes (NMs) are all configured to listen on private DNS. For example, a sample node manager listens on ip-10-16-141-168.ec2.internal:8042.

      When I'm trying to run a Tez job (even simple ones like select count from nation) - they fail because child tasks are unable to connect to the AM. The issue is they are trying to connect to the IP instead of the private DNS. Here's a sample log line (couple of them added by me for debugging):

      2015-07-21 17:08:21,919 INFO [main] task.TezChild: TezChild starting
      2015-07-21 17:08:22,310 INFO [main] task.TezChild: Using socket factory class: org.apache.hadoop.net.StandardSocketFactory
      2015-07-21 17:08:22,336 INFO [main] task.TezChild: PID, containerIdentifier:  3699, container_1437498369268_0001_01_000002
      2015-07-21 17:08:22,418 INFO [main] Configuration.deprecation: fs.default.name is deprecated. Instead, use fs.defaultFS
      2015-07-21 17:08:23,025 INFO [main] task.TezChild: Got host:port: 10.16.141.168:37949
      2015-07-21 17:08:23,035 INFO [main] task.TezChild: address variables: 10.16.141.168:37949
      2015-07-21 17:08:23,143 INFO [TezChild] task.ContainerReporter: Attempting to fetch new task
      2015-07-21 17:08:24,201 INFO [TezChild] ipc.Client: Retrying connect to server: 10.16.141.168/10.16.141.168:37949. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
      2015-07-21 17:08:25,202 INFO [TezChild] ipc.Client: Retrying connect to server: 10.16.141.168/10.16.141.168:37949. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
      2015-07-21 17:08:26,757 INFO [TezChild] ipc.Client: Retrying connect to server: 10.16.141.168/10.16.141.168:37949. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
      2015-07-21 17:08:27,758 INFO [TezChild] ipc.Client: Retrying connect to server: 10.16.141.168/10.16.141.168:37949. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
      

      AM is listening at the right address. But TezChild is receiving the IP address instead of the private DNS.

      AM logs:

      2015-07-21 18:09:27,906 INFO [ServiceThread:org.apache.tez.dag.app.TaskAttemptListenerImpTezDag] app.TaskAttemptListenerImpTezDag: Listening at address: ip-10-234-2-80.ec2.internal:49967
      

      TezChild logs:

      2015-07-21 18:09:35,353 INFO [main] task.TezChild: TezChild starting
      2015-07-21 18:09:35,379 INFO [main] task.TezChild: Args: 10.234.2.80,49967,container_1437501941642_0001_01_000002,application_1437501941642_0001,1
      2015-07-21 18:09:35,770 INFO [main] task.TezChild: Using socket factory class: org.apache.hadoop.net.StandardSocketFactory
      2015-07-21 18:09:35,785 INFO [main] task.TezChild: PID, containerIdentifier:  8670, container_1437501941642_0001_01_000002
      2015-07-21 18:09:35,864 INFO [main] Configuration.deprecation: fs.default.name is deprecated. Instead, use fs.defaultFS
      2015-07-21 18:09:36,403 INFO [main] task.TezChild: Got host:port: 10.234.2.80:49967
      2015-07-21 18:09:36,413 INFO [main] task.TezChild: address variables: 10.234.2.80:49967
      

      Attachments

        1. TEZ-2630.patch
          0.8 kB
          Rajat Jain
        2. TEZ-2630.2.patch
          0.9 kB
          Hitesh Shah
        3. TEZ-2630.3.patch
          2 kB
          Hitesh Shah

        Activity

          People

            hitesh Hitesh Shah
            rajatj Rajat Jain
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: