Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.3.0
    • Component/s: Cluster Management, Mesos
    • Labels:
      None

      Description

      In certain Mesos/DCOS environments, the slave hostnames aren't resolvable. For this and other reasons, Mesos DNS names would ideally be used for communication within the Flink cluster, not the hostname discovered via `InetAddress.getLocalHost`.

      Some parts of Flink are already configurable in this respect, notably `jobmanager.rpc.address`. However, the Mesos AppMaster doesn't use that setting for everything (e.g. artifact server), it uses the hostname.

      Similarly, the `taskmanager.hostname` setting isn't used in Mesos deployment mode. To effectively use Mesos DNS, the TM should use `<task-name>.<framework-name>.mesos` as its hostname. This could be derived from an interpolated configuration string.

        Issue Links

          Activity

          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user vijikarthi opened a pull request:

          https://github.com/apache/flink/pull/3692

          FLINK-5974 Added configurations to support mesos-dns hostname resolution

          This PR addresses FLINK-5974 requirements which takes care of handling dynamic host name resolution for JM and TM components especially in some deployment environment like Mesos/DCOS.

          It addresses two main functionalities.

          a) Dynamic host name configuration

          Support for specifying hostname for JM/TM is already available through `-jobmanager.rpc.address` and `taskmanager.hostname` configurations.

          However in Mesos DC/OS type of environment, each task container can be looked up using an hostname alias which is derived using the format `<task>.<service>.mesos` where the service discovery is managed through `mesos-dns`. To support these dynamic hostname lookup, we have introduced a new configuration `mesos.resourcemanager.tasks.hostname` which takes the format `_TASK.<ANY_VALUE>`.

          When this property is supplied, the `_TASK` token will be replaced with the `TASK_ID` of the TM container and the final derived string will be used to populate `taskmanager.hostname` configuration.

          For example, in DCOS setup one could supply the configuration as `-Dmesos.resourcemanager.tasks.hostname=_TASK.FRAMEWORK_NAME.mesos` where `FRAMEWORK_NAME` could be `flink`

          Please refer to https://docs.mesosphere.com/1.9/usage/service-discovery/mesos-dns/service-naming/#a-records for more details on how Mesos service discovery works.

          b) Support to run any bootstrap script prior to execute TM startup script

          Currently, the TM boot script `mesos-taskmanager.sh` is the only script that is passed to Mesos launcher for booting TM container.

          In DC/OS environment where service discovery is common, we need a mechanism to wait for the service discovery records to be available and the hostname is indeed resolvable before launching the TM boot script.

          DCOS deployment offers a way to validate and wait for the service discovery records to be available before launching the tasks. Please see below links for more details on how it works.
          https://mesosphere.github.io/dcos-commons/developer-guide.html#task-bootstrap
          https://github.com/mesosphere/dcos-commons/blob/master/sdk/bootstrap/main.go

          To support this, we have introduced a new configuration `mesos.resourcemanager.tasks.cmd-prefix=$FLINK_HOME/bin/bootstrap` to provide any executable/script that can be configured to run prior to executing the TM bootstrap command.

          This feature currently works only for Docker based image where the bootstrap script can be pre-baked in to a specific location that can be used to configure `mesos.resourcemanager.tasks.cmd-prefix'.

          While both the implementations are helping in addressing the Mesos/DCOS type of deployment but the implementation is agnostic of these environments and can be used for any generic deployment that may need such a facility.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/vijikarthi/flink FLINK-5974-Master

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/3692.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #3692


          commit aeb432dc7fe8bcdd5faa49b8ad5dfb5630ea0747
          Author: Vijay Srinivasaraghavan <vijayaraghavan.srinivasaraghavan@emc.com>
          Date: 2017-04-06T16:48:39Z

          FLINK-5974 Added configurations to support mesos-dns hostname resolution


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user vijikarthi opened a pull request: https://github.com/apache/flink/pull/3692 FLINK-5974 Added configurations to support mesos-dns hostname resolution This PR addresses FLINK-5974 requirements which takes care of handling dynamic host name resolution for JM and TM components especially in some deployment environment like Mesos/DCOS. It addresses two main functionalities. a) Dynamic host name configuration Support for specifying hostname for JM/TM is already available through `-jobmanager.rpc.address` and `taskmanager.hostname` configurations. However in Mesos DC/OS type of environment, each task container can be looked up using an hostname alias which is derived using the format `<task>.<service>.mesos` where the service discovery is managed through `mesos-dns`. To support these dynamic hostname lookup, we have introduced a new configuration `mesos.resourcemanager.tasks.hostname` which takes the format `_TASK.<ANY_VALUE>`. When this property is supplied, the `_TASK` token will be replaced with the `TASK_ID` of the TM container and the final derived string will be used to populate `taskmanager.hostname` configuration. For example, in DCOS setup one could supply the configuration as `-Dmesos.resourcemanager.tasks.hostname=_TASK. FRAMEWORK_NAME .mesos` where `FRAMEWORK_NAME` could be `flink` Please refer to https://docs.mesosphere.com/1.9/usage/service-discovery/mesos-dns/service-naming/#a-records for more details on how Mesos service discovery works. b) Support to run any bootstrap script prior to execute TM startup script Currently, the TM boot script `mesos-taskmanager.sh` is the only script that is passed to Mesos launcher for booting TM container. In DC/OS environment where service discovery is common, we need a mechanism to wait for the service discovery records to be available and the hostname is indeed resolvable before launching the TM boot script. DCOS deployment offers a way to validate and wait for the service discovery records to be available before launching the tasks. Please see below links for more details on how it works. https://mesosphere.github.io/dcos-commons/developer-guide.html#task-bootstrap https://github.com/mesosphere/dcos-commons/blob/master/sdk/bootstrap/main.go To support this, we have introduced a new configuration `mesos.resourcemanager.tasks.cmd-prefix=$FLINK_HOME/bin/bootstrap` to provide any executable/script that can be configured to run prior to executing the TM bootstrap command. This feature currently works only for Docker based image where the bootstrap script can be pre-baked in to a specific location that can be used to configure `mesos.resourcemanager.tasks.cmd-prefix'. While both the implementations are helping in addressing the Mesos/DCOS type of deployment but the implementation is agnostic of these environments and can be used for any generic deployment that may need such a facility. You can merge this pull request into a Git repository by running: $ git pull https://github.com/vijikarthi/flink FLINK-5974 -Master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/3692.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3692 commit aeb432dc7fe8bcdd5faa49b8ad5dfb5630ea0747 Author: Vijay Srinivasaraghavan <vijayaraghavan.srinivasaraghavan@emc.com> Date: 2017-04-06T16:48:39Z FLINK-5974 Added configurations to support mesos-dns hostname resolution
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/3692

          Looks good to me. Thanks @EronWright for also checking this out...

          Only minor request I have would be to use something like `MesosConfigOptions` to store the config keys. We are trying to move away from the pure String based config constants. Have a look at the `YarnConfigOptions` for an example.

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3692 Looks good to me. Thanks @EronWright for also checking this out... Only minor request I have would be to use something like `MesosConfigOptions` to store the config keys. We are trying to move away from the pure String based config constants. Have a look at the `YarnConfigOptions` for an example.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user EronWright commented on the issue:

          https://github.com/apache/flink/pull/3692

          @StephanEwen in this case, the [MesosTaskManagerParameters](https://github.com/vijikarthi/flink/blob/aeb432dc7fe8bcdd5faa49b8ad5dfb5630ea0747/flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/MesosTaskManagerParameters.java#L39) is acting as the configuration class and the various `ConfigOption`'s are defined within. Placing constants also into `ConfigConstants` is now unnecessary and discouraged, and I feel that's the (minor) issue here.

          Show
          githubbot ASF GitHub Bot added a comment - Github user EronWright commented on the issue: https://github.com/apache/flink/pull/3692 @StephanEwen in this case, the [MesosTaskManagerParameters] ( https://github.com/vijikarthi/flink/blob/aeb432dc7fe8bcdd5faa49b8ad5dfb5630ea0747/flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/MesosTaskManagerParameters.java#L39 ) is acting as the configuration class and the various `ConfigOption`'s are defined within. Placing constants also into `ConfigConstants` is now unnecessary and discouraged, and I feel that's the (minor) issue here.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/3692

          Sounds good.

          I can do a final pass and merge this later this week...

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3692 Sounds good. I can do a final pass and merge this later this week...
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on the issue:

          https://github.com/apache/flink/pull/3692

          Thanks for your contribution @vijikarthi and @EronWright and @StephanEwen for the review work. I will address @StephanEwen's comments and then merge this PR.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/3692 Thanks for your contribution @vijikarthi and @EronWright and @StephanEwen for the review work. I will address @StephanEwen's comments and then merge this PR.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on the issue:

          https://github.com/apache/flink/pull/3692

          Why do we actually make `TaskManager` hostname configurable via `mesos.resourcemanager.tasks.hostname`? Isn't the mesos-dns hostname already uniquely defined by the `TaskID` and the framework name contained in `FrameworkInfo`?

          If the `FrameworkInfo` does not contain the service name, then we should only specify this information. Then we could rename the configuration parameter into `mesos.service-name` and concatenate `taskId`, `service-name` and `mesos`, where `taskId` is retrieved from the `TaskID` instance.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/3692 Why do we actually make `TaskManager` hostname configurable via `mesos.resourcemanager.tasks.hostname`? Isn't the mesos-dns hostname already uniquely defined by the `TaskID` and the framework name contained in `FrameworkInfo`? If the `FrameworkInfo` does not contain the service name, then we should only specify this information. Then we could rename the configuration parameter into `mesos.service-name` and concatenate `taskId`, `service-name` and `mesos`, where `taskId` is retrieved from the `TaskID` instance.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user EronWright commented on the issue:

          https://github.com/apache/flink/pull/3692

          @tillrohrmann Mesos DNS is not the only DNS solution for Mesos, it is merely the DCOS solution. By using an interpolated string, the name is fully configurable.

          Show
          githubbot ASF GitHub Bot added a comment - Github user EronWright commented on the issue: https://github.com/apache/flink/pull/3692 @tillrohrmann Mesos DNS is not the only DNS solution for Mesos, it is merely the DCOS solution. By using an interpolated string, the name is fully configurable.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on the issue:

          https://github.com/apache/flink/pull/3692

          Alright, this makes sense. The common denominator is that every dns name will have an optional task placeholder (`_TASK`) and everything else is static, right?

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/3692 Alright, this makes sense. The common denominator is that every dns name will have an optional task placeholder (`_TASK`) and everything else is static, right?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user vijikarthi commented on the issue:

          https://github.com/apache/flink/pull/3692

          @tillrohrmann Please let me know if you are waiting for any clarifications before this PR could be merged.

          Show
          githubbot ASF GitHub Bot added a comment - Github user vijikarthi commented on the issue: https://github.com/apache/flink/pull/3692 @tillrohrmann Please let me know if you are waiting for any clarifications before this PR could be merged.
          Hide
          till.rohrmann Till Rohrmann added a comment -

          Added via d7364fffbf552aed79e537a7aec3af593cb4e159

          Show
          till.rohrmann Till Rohrmann added a comment - Added via d7364fffbf552aed79e537a7aec3af593cb4e159
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/flink/pull/3692

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/3692

            People

            • Assignee:
              vijikarthi Vijay Srinivasaraghavan
              Reporter:
              eronwright Eron Wright
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development