Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-7540

Akka hostnames are not normalised consistently

    XMLWordPrintableJSON

Details

    • Patch, Important

    Description

      In NetUtils.unresolvedHostToNormalizedString() we lowercase hostnames, Akka seems to preserve the uppercase/lowercase distinctions when starting the Actor. This leads to problems because other parts (for example JobManagerRetriever) cannot find the actor leading to a nonfunctional cluster.

      Original Issue Text

      Hostnames in my hadoop cluster are like these: “DSJ-RTB-4T-177”,” DSJ-signal-900G-71”
      When using the following command:
      ./bin/flink run -m yarn-cluster -yn 1 -yqu xl_trip -yjm 1024 ~/flink-1.3.1/examples/batch/WordCount.jar --input /user/all_trip_dev/test/testcount.txt --output /user/all_trip_dev/test/result
      Or
      ./bin/yarn-session.sh -d -jm 6144 -tm 12288 -qu xl_trip -s 24 -n 5 -nm "flink-YarnSession-jm6144-tm12288-s24-n5-xl_trip"
      There will be some exceptions at Command line interface:

      java.lang.RuntimeException: Unable to get ClusterClient status from Application Client
      at org.apache.flink.yarn.YarnClusterClient.getClusterStatus(YarnClusterClient.java:243)

      Caused by: org.apache.flink.util.FlinkException: Could not connect to the leading JobManager. Please check that the JobManager is running.

      Then the job fails , starting the yarn-session is the same.

      The exceptions of the application log:
      2017-08-10 17:36:10,334 WARN org.apache.flink.runtime.webmonitor.JobManagerRetriever - Failed to retrieve leader gateway and port.
      akka.actor.ActorNotFound: Actor not found for: ActorSelection[Anchor(akka.tcp://flink@dsj-signal-4t-248:65082/), Path(/user/jobmanager)]

      2017-08-10 17:36:10,837 ERROR org.apache.flink.yarn.YarnFlinkResourceManager - Resource manager could not register at JobManager
      akka.pattern.AskTimeoutException: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink@dsj-signal-4t-248:65082/), Path(/user/jobmanager)]] after [10000 ms]

      And I found some differences in actor System:
      2017-08-10 17:35:56,791 INFO org.apache.flink.yarn.YarnJobManager - Starting JobManager at akka.tcp://flink@DSJ-signal-4T-248:65082/user/jobmanager.
      2017-08-10 17:35:56,880 INFO org.apache.flink.yarn.YarnJobManager - JobManager akka.tcp://flink@DSJ-signal-4T-248:65082/user/jobmanager was granted leadership with leader session ID Some(00000000-0000-0000-0000-000000000000).
      2017-08-10 17:36:00,312 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Web frontend listening at 0:0:0:0:0:0:0:0:54921
      2017-08-10 17:36:00,312 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Starting with JobManager akka.tcp://flink@DSJ-signal-4T-248:65082/user/jobmanager on port 54921
      2017-08-10 17:36:00,313 INFO org.apache.flink.runtime.webmonitor.JobManagerRetriever - New leader reachable under akka.tcp://flink@dsj-signal-4t-248:65082/user/jobmanager:00000000-0000-0000-0000-000000000000.

      The JobManager is “akka.tcp://flink@DSJ-signal-4T-248:65082” and the JobManagerRetriever is “akka.tcp://flink@dsj-signal-4t-248:65082”
      The hostname of JobManagerRetriever’s actor is lowercase.

      And I read source code,
      Class NetUtils the unresolvedHostToNormalizedString(String host) method of line 127:
      public static String unresolvedHostToNormalizedString(String host) {
      // Return loopback interface address if host is null
      // This represents the behavior of

      {@code InetAddress.getByName }

      and RFC 3330 if (host == null)

      { host = InetAddress.getLoopbackAddress().getHostAddress(); }

      else

      { host = host.trim().toLowerCase(); }

      ...
      }

      It turns the host name into lowercase.
      Therefore, JobManagerRetriever certainly can not find Jobmanager's actorSYstem.
      Then I removed the call to the toLowerCase() method in the source code.

      Finally ,I can submit a job in yarn-cluster mode and start a yarn-session.

      Attachments

        Issue Links

          Activity

            People

              trohrmann Till Rohrmann
              oty5081 Tong Yan Ou
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - 336h
                  336h
                  Remaining:
                  Remaining Estimate - 336h
                  336h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified