Details

    • Type: New Feature
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.2.0
    • Component/s: Cluster Management, Mesos
    • Labels:
      None

      Description

      There are some users asking for an integration of Flink into Mesos.

      There also is a pending pull request for adding Mesos support for Flink: https://github.com/apache/flink/pull/251

      Update (May '16): a new effort is now underway, building on the recent ResourceManager work.

      Update (Oct '16): the core functionality is in the master branch. New sub-tasks track remaining work for a first release.

      Design document: (google doc)

      1. 251.patch
        888 kB
        Robert Metzger

        Issue Links

          Activity

          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user sgran commented on the issue:

          https://github.com/apache/flink/pull/3586

          I've closed this, opened a JIRA, and rebased the commit onto master.

          The new PR is #3744

          Show
          githubbot ASF GitHub Bot added a comment - Github user sgran commented on the issue: https://github.com/apache/flink/pull/3586 I've closed this, opened a JIRA, and rebased the commit onto master. The new PR is #3744
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user sgran closed the pull request at:

          https://github.com/apache/flink/pull/3586

          Show
          githubbot ASF GitHub Bot added a comment - Github user sgran closed the pull request at: https://github.com/apache/flink/pull/3586
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user EronWright commented on the issue:

          https://github.com/apache/flink/pull/3586

          Thanks for this contribution, I'm glad to see Flink take advantage of Fenzo's considerable power.

          Please open a new bug for this, since FLINK-1984 is closed, assign it to yourself, and update the PR description accordingly. Mark the bug with 'Mesos' component.

          Show
          githubbot ASF GitHub Bot added a comment - Github user EronWright commented on the issue: https://github.com/apache/flink/pull/3586 Thanks for this contribution, I'm glad to see Flink take advantage of Fenzo's considerable power. Please open a new bug for this, since FLINK-1984 is closed, assign it to yourself, and update the PR description accordingly. Mark the bug with 'Mesos' component.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user EronWright commented on a diff in the pull request:

          https://github.com/apache/flink/pull/3586#discussion_r112036109

          — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/MesosTaskManagerParameters.java —
          @@ -166,7 +191,36 @@ public static MesosTaskManagerParameters create(Configuration flinkConfig)

          { cpus, containerType, Option.apply(imageName), - containeredParameters); + containeredParameters, + constraints); + }

          +
          + private static List<ConstraintEvaluator> parseConstraints(String mesosConstraints) {
          + List<ConstraintEvaluator> constraints = new ArrayList<>();
          +
          + if (mesosConstraints != null) {
          + for (String constraint : Arrays.asList(mesosConstraints.split(","))) {
          + if (constraint.isEmpty())

          { + continue; + }
          + final List<String> constraintList = Arrays.asList(constraint.split(":"));
          + if (constraintList.size() != 2) { + continue; + }

          + addConstraint(constraints, constraintList);
          + }
          + }
          +
          + return constraints;
          + }
          +
          + private static void addConstraint(List<ConstraintEvaluator> constraints, final List<String> constraintList) {
          — End diff –

          Would you mind replacing `constraintList` with a pair of arguments for the attr name and value, and rename the `addConstraint` method to `addHostAttrValueConstraint`?

          Show
          githubbot ASF GitHub Bot added a comment - Github user EronWright commented on a diff in the pull request: https://github.com/apache/flink/pull/3586#discussion_r112036109 — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/MesosTaskManagerParameters.java — @@ -166,7 +191,36 @@ public static MesosTaskManagerParameters create(Configuration flinkConfig) { cpus, containerType, Option.apply(imageName), - containeredParameters); + containeredParameters, + constraints); + } + + private static List<ConstraintEvaluator> parseConstraints(String mesosConstraints) { + List<ConstraintEvaluator> constraints = new ArrayList<>(); + + if (mesosConstraints != null) { + for (String constraint : Arrays.asList(mesosConstraints.split(","))) { + if (constraint.isEmpty()) { + continue; + } + final List<String> constraintList = Arrays.asList(constraint.split(":")); + if (constraintList.size() != 2) { + continue; + } + addConstraint(constraints, constraintList); + } + } + + return constraints; + } + + private static void addConstraint(List<ConstraintEvaluator> constraints, final List<String> constraintList) { — End diff – Would you mind replacing `constraintList` with a pair of arguments for the attr name and value, and rename the `addConstraint` method to `addHostAttrValueConstraint`?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user EronWright commented on a diff in the pull request:

          https://github.com/apache/flink/pull/3586#discussion_r112036914

          — Diff: flink-mesos/src/test/java/org/apache/flink/mesos/runtime/clusterframework/MesosTaskManagerParametersTest.java —
          @@ -0,0 +1,95 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.runtime.clusterframework;
          +
          +import static org.hamcrest.Matchers.is;
          +import static org.junit.Assert.assertThat;
          +
          +import org.apache.flink.configuration.Configuration;
          +import org.junit.Test;
          +
          +import com.netflix.fenzo.ConstraintEvaluator;
          +import com.netflix.fenzo.functions.Func1;
          +import com.netflix.fenzo.plugins.HostAttrValueConstraint;
          +
          +public class MesosTaskManagerParametersTest {
          +
          +
          + @Test
          + public void givenTwoConstraintsInConfigShouldBeParsed() throws Exception {
          +
          + MesosTaskManagerParameters mesosTaskManagerParameters = MesosTaskManagerParameters.create(withConfiguration("cluster:foo,az:eu-west-1"));
          + assertThat(mesosTaskManagerParameters.constraints().size(), is(2));
          + ConstraintEvaluator firstConstraintEvaluator = new HostAttrValueConstraint("cluster", new Func1<String, String>() {
          + @Override
          + public String call(String s)

          { + return "foo"; + }
          + });
          + ConstraintEvaluator secondConstraintEvaluator = new HostAttrValueConstraint("az", new Func1<String, String>() {
          + @Override
          + public String call(String s) { + return "foo"; + }

          + });
          + assertThat(mesosTaskManagerParameters.constraints().get(0).getName(), is(firstConstraintEvaluator.getName()));
          + assertThat(mesosTaskManagerParameters.constraints().get(1).getName(), is(secondConstraintEvaluator.getName()));
          +
          + }
          +
          + @Test
          + public void givenOneConstraintInConfigShouldBeParsed() throws Exception {
          +
          + MesosTaskManagerParameters mesosTaskManagerParameters = MesosTaskManagerParameters.create(withConfiguration("cluster:foo"));
          + assertThat(mesosTaskManagerParameters.constraints().size(), is(1));
          + ConstraintEvaluator firstConstraintEvaluator = new HostAttrValueConstraint("cluster", new Func1<String, String>() {
          + @Override
          + public String call(String s)

          { + return "foo"; + }

          + });
          + assertThat(mesosTaskManagerParameters.constraints().get(0).getName(), is(firstConstraintEvaluator.getName()));
          + }
          +
          + @Test
          + public void givenEmptyConstraintInConfigShouldBeParsed() throws Exception

          { + + MesosTaskManagerParameters mesosTaskManagerParameters = MesosTaskManagerParameters.create(withConfiguration("")); + assertThat(mesosTaskManagerParameters.constraints().size(), is(0)); + }

          +
          + @Test
          + public void givenInvalidConstraintInConfigShouldBeParsed() throws Exception

          { + + MesosTaskManagerParameters mesosTaskManagerParameters = MesosTaskManagerParameters.create(withConfiguration(",:,")); + assertThat(mesosTaskManagerParameters.constraints().size(), is(0)); + }

          +
          +
          + private static Configuration withConfiguration(final String configuration) {
          — End diff –

          This method is too specific for its name, given that the class is called `MesosTaskManagerParametersTest`. Maybe it would be better to take a varargs of `Tuple2<String,String>` to make it more generic.

          Show
          githubbot ASF GitHub Bot added a comment - Github user EronWright commented on a diff in the pull request: https://github.com/apache/flink/pull/3586#discussion_r112036914 — Diff: flink-mesos/src/test/java/org/apache/flink/mesos/runtime/clusterframework/MesosTaskManagerParametersTest.java — @@ -0,0 +1,95 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.runtime.clusterframework; + +import static org.hamcrest.Matchers.is; +import static org.junit.Assert.assertThat; + +import org.apache.flink.configuration.Configuration; +import org.junit.Test; + +import com.netflix.fenzo.ConstraintEvaluator; +import com.netflix.fenzo.functions.Func1; +import com.netflix.fenzo.plugins.HostAttrValueConstraint; + +public class MesosTaskManagerParametersTest { + + + @Test + public void givenTwoConstraintsInConfigShouldBeParsed() throws Exception { + + MesosTaskManagerParameters mesosTaskManagerParameters = MesosTaskManagerParameters.create(withConfiguration("cluster:foo,az:eu-west-1")); + assertThat(mesosTaskManagerParameters.constraints().size(), is(2)); + ConstraintEvaluator firstConstraintEvaluator = new HostAttrValueConstraint("cluster", new Func1<String, String>() { + @Override + public String call(String s) { + return "foo"; + } + }); + ConstraintEvaluator secondConstraintEvaluator = new HostAttrValueConstraint("az", new Func1<String, String>() { + @Override + public String call(String s) { + return "foo"; + } + }); + assertThat(mesosTaskManagerParameters.constraints().get(0).getName(), is(firstConstraintEvaluator.getName())); + assertThat(mesosTaskManagerParameters.constraints().get(1).getName(), is(secondConstraintEvaluator.getName())); + + } + + @Test + public void givenOneConstraintInConfigShouldBeParsed() throws Exception { + + MesosTaskManagerParameters mesosTaskManagerParameters = MesosTaskManagerParameters.create(withConfiguration("cluster:foo")); + assertThat(mesosTaskManagerParameters.constraints().size(), is(1)); + ConstraintEvaluator firstConstraintEvaluator = new HostAttrValueConstraint("cluster", new Func1<String, String>() { + @Override + public String call(String s) { + return "foo"; + } + }); + assertThat(mesosTaskManagerParameters.constraints().get(0).getName(), is(firstConstraintEvaluator.getName())); + } + + @Test + public void givenEmptyConstraintInConfigShouldBeParsed() throws Exception { + + MesosTaskManagerParameters mesosTaskManagerParameters = MesosTaskManagerParameters.create(withConfiguration("")); + assertThat(mesosTaskManagerParameters.constraints().size(), is(0)); + } + + @Test + public void givenInvalidConstraintInConfigShouldBeParsed() throws Exception { + + MesosTaskManagerParameters mesosTaskManagerParameters = MesosTaskManagerParameters.create(withConfiguration(",:,")); + assertThat(mesosTaskManagerParameters.constraints().size(), is(0)); + } + + + private static Configuration withConfiguration(final String configuration) { — End diff – This method is too specific for its name, given that the class is called `MesosTaskManagerParametersTest`. Maybe it would be better to take a varargs of `Tuple2<String,String>` to make it more generic.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user sgran reopened a pull request:

          https://github.com/apache/flink/pull/3586

          FLINK-1984 initial commit of hard constraint evaluator

          This is related to the work in FLINK-1984. In earlier patch sets, mesos constraints were evaluated, but that appears to have been dropped in the fenzo code, and now all constraint evaluators return null.

          This is a start of exposing the fenzo constraint system, and only exposes a minimal subset of that functionality for now, hopefully in a way that allows it to be extended by later authors.

          Signed-off-by: Stephen Gran <stephen.gran@piksel.com>

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/pikselpalette/flink add_constraints

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/3586.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #3586


          commit 67026533afd15c1572ac17d75119656897b03f52
          Author: Stephen Gran <stephen.gran@piksel.com>
          Date: 2017-03-21T12:36:03Z

          initial commit of hard constraint evaluator

          Signed-off-by: Stephen Gran <stephen.gran@piksel.com>

          commit 5e9413122bfdbcb6bdec6c62fb0cf99e5450572b
          Author: Stephen Gran <stephen.gran@piksel.com>
          Date: 2017-03-22T10:27:56Z

          add license header to unit test

          Signed-off-by: Stephen Gran <stephen.gran@piksel.com>


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user sgran reopened a pull request: https://github.com/apache/flink/pull/3586 FLINK-1984 initial commit of hard constraint evaluator This is related to the work in FLINK-1984 . In earlier patch sets, mesos constraints were evaluated, but that appears to have been dropped in the fenzo code, and now all constraint evaluators return null. This is a start of exposing the fenzo constraint system, and only exposes a minimal subset of that functionality for now, hopefully in a way that allows it to be extended by later authors. Signed-off-by: Stephen Gran <stephen.gran@piksel.com> You can merge this pull request into a Git repository by running: $ git pull https://github.com/pikselpalette/flink add_constraints Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/3586.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3586 commit 67026533afd15c1572ac17d75119656897b03f52 Author: Stephen Gran <stephen.gran@piksel.com> Date: 2017-03-21T12:36:03Z initial commit of hard constraint evaluator Signed-off-by: Stephen Gran <stephen.gran@piksel.com> commit 5e9413122bfdbcb6bdec6c62fb0cf99e5450572b Author: Stephen Gran <stephen.gran@piksel.com> Date: 2017-03-22T10:27:56Z add license header to unit test Signed-off-by: Stephen Gran <stephen.gran@piksel.com>
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user sgran closed the pull request at:

          https://github.com/apache/flink/pull/3586

          Show
          githubbot ASF GitHub Bot added a comment - Github user sgran closed the pull request at: https://github.com/apache/flink/pull/3586
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user sgran opened a pull request:

          https://github.com/apache/flink/pull/3586

          FLINK-1984 initial commit of hard constraint evaluator

          This is related to the work in FLINK-1984. In earlier patch sets, mesos constraints were evaluated, but that appears to have been dropped in the fenzo code, and now all constraint evaluators return null.

          This is a start of exposing the fenzo constraint system, and only exposes a minimal subset of that functionality for now, hopefully in a way that allows it to be extended by later authors.

          Signed-off-by: Stephen Gran <stephen.gran@piksel.com>

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/pikselpalette/flink add_constraints

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/3586.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #3586



          Show
          githubbot ASF GitHub Bot added a comment - GitHub user sgran opened a pull request: https://github.com/apache/flink/pull/3586 FLINK-1984 initial commit of hard constraint evaluator This is related to the work in FLINK-1984 . In earlier patch sets, mesos constraints were evaluated, but that appears to have been dropped in the fenzo code, and now all constraint evaluators return null. This is a start of exposing the fenzo constraint system, and only exposes a minimal subset of that functionality for now, hopefully in a way that allows it to be extended by later authors. Signed-off-by: Stephen Gran <stephen.gran@piksel.com> You can merge this pull request into a Git repository by running: $ git pull https://github.com/pikselpalette/flink add_constraints Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/3586.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3586
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on the issue:

          https://github.com/apache/flink/pull/2315

          Thank you so much for your contribution @EronWright! Looking forward to see the Mesos integration in Flink grow

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on the issue: https://github.com/apache/flink/pull/2315 Thank you so much for your contribution @EronWright! Looking forward to see the Mesos integration in Flink grow
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/flink/pull/2315

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/2315
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on the issue:

          https://github.com/apache/flink/pull/2315

          Fixing the errors and trying to rebase your changes on top of the latest master now.

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on the issue: https://github.com/apache/flink/pull/2315 Fixing the errors and trying to rebase your changes on top of the latest master now.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r76621116

          — Diff: flink-runtime/src/main/java/org/apache/flink/runtime/clusterframework/ContaineredJobManager.scala —
          @@ -0,0 +1,172 @@
          +/*
          — End diff –

          This Scala file is in /java.

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r76621116 — Diff: flink-runtime/src/main/java/org/apache/flink/runtime/clusterframework/ContaineredJobManager.scala — @@ -0,0 +1,172 @@ +/* — End diff – This Scala file is in /java.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on the issue:

          https://github.com/apache/flink/pull/2315

          There are some compilation issues in the build logs:

          ```
          [ERROR] COMPILATION ERROR :
          [INFO] -------------------------------------------------------------
          [ERROR] /home/jenkins/jenkins-slave/workspace/flink-github-ci/flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/store/StandaloneMesosWorkerStore.java:[21,33] package com.google.common.collect does not exist
          [ERROR] /home/jenkins/jenkins-slave/workspace/flink-github-ci/flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/store/StandaloneMesosWorkerStore.java:[65,24] cannot find symbol
          ```

          Could you rebase to the latest master and also remove any merge commits? Thanks!

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on the issue: https://github.com/apache/flink/pull/2315 There are some compilation issues in the build logs: ``` [ERROR] COMPILATION ERROR : [INFO] ------------------------------------------------------------- [ERROR] /home/jenkins/jenkins-slave/workspace/flink-github-ci/flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/store/StandaloneMesosWorkerStore.java: [21,33] package com.google.common.collect does not exist [ERROR] /home/jenkins/jenkins-slave/workspace/flink-github-ci/flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/store/StandaloneMesosWorkerStore.java: [65,24] cannot find symbol ``` Could you rebase to the latest master and also remove any merge commits? Thanks!
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user EronWright commented on the issue:

          https://github.com/apache/flink/pull/2315

          @mxm ready

          Show
          githubbot ASF GitHub Bot added a comment - Github user EronWright commented on the issue: https://github.com/apache/flink/pull/2315 @mxm ready
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on the issue:

          https://github.com/apache/flink/pull/2315

          Let us know when you have made the last changes. I think we could then go ahead and merge this PR. I'm getting excited about this to be finally in the master :flushed:

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on the issue: https://github.com/apache/flink/pull/2315 Let us know when you have made the last changes. I think we could then go ahead and merge this PR. I'm getting excited about this to be finally in the master :flushed:
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75642882

          — Diff: flink-mesos/src/main/scala/org/apache/flink/mesos/scheduler/ConnectionMonitor.scala —
          @@ -0,0 +1,126 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.scheduler
          +
          +import akka.actor.

          {Actor, FSM, Props}

          +import grizzled.slf4j.Logger
          +import org.apache.flink.configuration.Configuration
          +import org.apache.flink.mesos.scheduler.ConnectionMonitor._
          +import org.apache.flink.mesos.scheduler.messages._
          +
          +import scala.concurrent.duration._
          +
          +/**
          + * Actively monitors the Mesos connection.
          + */
          +class ConnectionMonitor() extends Actor with FSM[FsmState, Unit] {
          +
          + val LOG = Logger(getClass)
          +
          + startWith(StoppedState, None)
          +
          + when(StoppedState)

          { + case Event(msg: Start, _) => + LOG.info(s"Connecting to Mesos...") + goto(ConnectingState) + }

          +
          + when(ConnectingState, stateTimeout = CONNECT_RETRY_RATE) {
          + case Event(msg: Stop, _) =>
          + goto(StoppedState)
          +
          + case Event(msg: Registered, _) =>
          + LOG.info(s"Connected to Mesos as framework ID $

          {msg.frameworkId.getValue}

          .")
          + LOG.debug(s" Master Info: $

          {msg.masterInfo}")
          + goto(ConnectedState)
          +
          + case Event(msg: ReRegistered, _) =>
          + LOG.info("Reconnected to a new Mesos master.")
          + LOG.debug(s" Master Info: ${msg.masterInfo}

          ")
          + goto(ConnectedState)
          +
          + case Event(StateTimeout, _) =>
          + LOG.warn("Unable to connect to Mesos; still trying...")
          + stay()
          + }
          +
          + when(ConnectedState)

          { + case Event(msg: Stop, _) => + goto(StoppedState) + + case Event(msg: Disconnected, _) => + LOG.warn("Disconnected from the Mesos master. Reconnecting...") + goto(ConnectingState) + }

          +
          — End diff –

          The default is to just log a warning and drop the message?

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75642882 — Diff: flink-mesos/src/main/scala/org/apache/flink/mesos/scheduler/ConnectionMonitor.scala — @@ -0,0 +1,126 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.scheduler + +import akka.actor. {Actor, FSM, Props} +import grizzled.slf4j.Logger +import org.apache.flink.configuration.Configuration +import org.apache.flink.mesos.scheduler.ConnectionMonitor._ +import org.apache.flink.mesos.scheduler.messages._ + +import scala.concurrent.duration._ + +/** + * Actively monitors the Mesos connection. + */ +class ConnectionMonitor() extends Actor with FSM [FsmState, Unit] { + + val LOG = Logger(getClass) + + startWith(StoppedState, None) + + when(StoppedState) { + case Event(msg: Start, _) => + LOG.info(s"Connecting to Mesos...") + goto(ConnectingState) + } + + when(ConnectingState, stateTimeout = CONNECT_RETRY_RATE) { + case Event(msg: Stop, _) => + goto(StoppedState) + + case Event(msg: Registered, _) => + LOG.info(s"Connected to Mesos as framework ID $ {msg.frameworkId.getValue} .") + LOG.debug(s" Master Info: $ {msg.masterInfo}") + goto(ConnectedState) + + case Event(msg: ReRegistered, _) => + LOG.info("Reconnected to a new Mesos master.") + LOG.debug(s" Master Info: ${msg.masterInfo} ") + goto(ConnectedState) + + case Event(StateTimeout, _) => + LOG.warn("Unable to connect to Mesos; still trying...") + stay() + } + + when(ConnectedState) { + case Event(msg: Stop, _) => + goto(StoppedState) + + case Event(msg: Disconnected, _) => + LOG.warn("Disconnected from the Mesos master. Reconnecting...") + goto(ConnectingState) + } + — End diff – The default is to just log a warning and drop the message?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75642452

          — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/LaunchableMesosWorker.java —
          @@ -0,0 +1,205 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.runtime.clusterframework;
          +
          +import com.netflix.fenzo.ConstraintEvaluator;
          +import com.netflix.fenzo.TaskAssignmentResult;
          +import com.netflix.fenzo.TaskRequest;
          +import com.netflix.fenzo.VMTaskFitnessCalculator;
          +import org.apache.flink.configuration.Configuration;
          +import org.apache.flink.mesos.cli.FlinkMesosSessionCli;
          +import org.apache.flink.mesos.scheduler.LaunchableTask;
          +import org.apache.mesos.Protos;
          +
          +import java.util.Collections;
          +import java.util.List;
          +import java.util.Map;
          +import java.util.concurrent.atomic.AtomicReference;
          +
          +import static org.apache.flink.mesos.Utils.variable;
          +import static org.apache.flink.mesos.Utils.range;
          +import static org.apache.flink.mesos.Utils.ranges;
          +import static org.apache.flink.mesos.Utils.scalar;
          +
          +/**
          + * Specifies how to launch a Mesos worker.
          + */
          +public class LaunchableMesosWorker implements LaunchableTask {
          +
          + /**
          + * The set of configuration keys to be dynamically configured with a port allocated from Mesos.
          + */
          + private static String[] TM_PORT_KEYS =

          { + "taskmanager.rpc.port", + "taskmanager.data.port" }

          ;
          +
          + private final MesosTaskManagerParameters params;
          + private final Protos.TaskInfo.Builder template;
          + private final Protos.TaskID taskID;
          + private final Request taskRequest;
          +
          + /**
          + * Construct a launchable Mesos worker.
          + * @param params the TM parameters such as memory, cpu to acquire.
          + * @param template a template for the TaskInfo to be constructed at launch time.
          + * @param taskID the taskID for this worker.
          + */
          + public LaunchableMesosWorker(MesosTaskManagerParameters params, Protos.TaskInfo.Builder template, Protos.TaskID taskID)

          { + this.params = params; + this.template = template; + this.taskID = taskID; + this.taskRequest = new Request(); + }

          +
          + public Protos.TaskID taskID()

          { + return taskID; + }

          +
          + @Override
          + public TaskRequest taskRequest()

          { + return taskRequest; + }

          +
          + class Request implements TaskRequest {
          + private final AtomicReference<TaskRequest.AssignedResources> assignedResources = new AtomicReference<>();
          +
          + @Override
          + public String getId()

          { + return taskID.getValue(); + }

          +
          + @Override
          + public String taskGroupName()

          { + return ""; + }

          +
          + @Override
          + public double getCPUs()

          { + return params.cpus(); + }

          +
          + @Override
          + public double getMemory()

          { + return params.containeredParameters().taskManagerTotalMemoryMB(); + }

          +
          + @Override
          + public double getNetworkMbps()

          { + return 0.0; + }

          +
          + @Override
          + public double getDisk() {
          + return 0.0;
          — End diff –

          So 0.0 means give me whatever? How about a minimum value, e.g. 500MB?

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75642452 — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/LaunchableMesosWorker.java — @@ -0,0 +1,205 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.runtime.clusterframework; + +import com.netflix.fenzo.ConstraintEvaluator; +import com.netflix.fenzo.TaskAssignmentResult; +import com.netflix.fenzo.TaskRequest; +import com.netflix.fenzo.VMTaskFitnessCalculator; +import org.apache.flink.configuration.Configuration; +import org.apache.flink.mesos.cli.FlinkMesosSessionCli; +import org.apache.flink.mesos.scheduler.LaunchableTask; +import org.apache.mesos.Protos; + +import java.util.Collections; +import java.util.List; +import java.util.Map; +import java.util.concurrent.atomic.AtomicReference; + +import static org.apache.flink.mesos.Utils.variable; +import static org.apache.flink.mesos.Utils.range; +import static org.apache.flink.mesos.Utils.ranges; +import static org.apache.flink.mesos.Utils.scalar; + +/** + * Specifies how to launch a Mesos worker. + */ +public class LaunchableMesosWorker implements LaunchableTask { + + /** + * The set of configuration keys to be dynamically configured with a port allocated from Mesos. + */ + private static String[] TM_PORT_KEYS = { + "taskmanager.rpc.port", + "taskmanager.data.port" } ; + + private final MesosTaskManagerParameters params; + private final Protos.TaskInfo.Builder template; + private final Protos.TaskID taskID; + private final Request taskRequest; + + /** + * Construct a launchable Mesos worker. + * @param params the TM parameters such as memory, cpu to acquire. + * @param template a template for the TaskInfo to be constructed at launch time. + * @param taskID the taskID for this worker. + */ + public LaunchableMesosWorker(MesosTaskManagerParameters params, Protos.TaskInfo.Builder template, Protos.TaskID taskID) { + this.params = params; + this.template = template; + this.taskID = taskID; + this.taskRequest = new Request(); + } + + public Protos.TaskID taskID() { + return taskID; + } + + @Override + public TaskRequest taskRequest() { + return taskRequest; + } + + class Request implements TaskRequest { + private final AtomicReference<TaskRequest.AssignedResources> assignedResources = new AtomicReference<>(); + + @Override + public String getId() { + return taskID.getValue(); + } + + @Override + public String taskGroupName() { + return ""; + } + + @Override + public double getCPUs() { + return params.cpus(); + } + + @Override + public double getMemory() { + return params.containeredParameters().taskManagerTotalMemoryMB(); + } + + @Override + public double getNetworkMbps() { + return 0.0; + } + + @Override + public double getDisk() { + return 0.0; — End diff – So 0.0 means give me whatever? How about a minimum value, e.g. 500MB?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75641703

          — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/cli/FlinkMesosSessionCli.java —
          @@ -0,0 +1,59 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.cli;
          +
          +import com.fasterxml.jackson.core.JsonProcessingException;
          +import com.fasterxml.jackson.core.type.TypeReference;
          +import com.fasterxml.jackson.databind.ObjectMapper;
          +import org.apache.flink.configuration.Configuration;
          +
          +import java.io.IOException;
          +import java.util.Map;
          +
          +public class FlinkMesosSessionCli {
          — End diff –

          Yes, I recognized it Sure! No problem.

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75641703 — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/cli/FlinkMesosSessionCli.java — @@ -0,0 +1,59 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.cli; + +import com.fasterxml.jackson.core.JsonProcessingException; +import com.fasterxml.jackson.core.type.TypeReference; +import com.fasterxml.jackson.databind.ObjectMapper; +import org.apache.flink.configuration.Configuration; + +import java.io.IOException; +import java.util.Map; + +public class FlinkMesosSessionCli { — End diff – Yes, I recognized it Sure! No problem.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75641515

          — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/MesosConfigKeys.java —
          @@ -0,0 +1,44 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.runtime.clusterframework;
          +
          +/**
          + * The Mesos environment variables used for settings of the containers.
          + */
          +public class MesosConfigKeys {
          — End diff –

          Yes, a follow-up is fine.

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75641515 — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/MesosConfigKeys.java — @@ -0,0 +1,44 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.runtime.clusterframework; + +/** + * The Mesos environment variables used for settings of the containers. + */ +public class MesosConfigKeys { — End diff – Yes, a follow-up is fine.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75641421

          — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/MesosFlinkResourceManager.java —
          @@ -0,0 +1,755 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.runtime.clusterframework;
          +
          +import akka.actor.ActorRef;
          +import akka.actor.Props;
          +import com.netflix.fenzo.TaskRequest;
          +import com.netflix.fenzo.TaskScheduler;
          +import com.netflix.fenzo.VirtualMachineLease;
          +import com.netflix.fenzo.functions.Action1;
          +import org.apache.flink.api.java.tuple.Tuple2;
          +import org.apache.flink.configuration.ConfigConstants;
          +import org.apache.flink.configuration.Configuration;
          +import org.apache.flink.mesos.runtime.clusterframework.store.MesosWorkerStore;
          +import org.apache.flink.mesos.scheduler.ConnectionMonitor;
          +import org.apache.flink.mesos.scheduler.LaunchableTask;
          +import org.apache.flink.mesos.scheduler.LaunchCoordinator;
          +import org.apache.flink.mesos.scheduler.ReconciliationCoordinator;
          +import org.apache.flink.mesos.scheduler.SchedulerProxy;
          +import org.apache.flink.mesos.scheduler.TaskMonitor;
          +import org.apache.flink.mesos.scheduler.TaskSchedulerBuilder;
          +import org.apache.flink.mesos.scheduler.Tasks;
          +import org.apache.flink.mesos.scheduler.messages.AcceptOffers;
          +import org.apache.flink.mesos.scheduler.messages.Disconnected;
          +import org.apache.flink.mesos.scheduler.messages.Error;
          +import org.apache.flink.mesos.scheduler.messages.OfferRescinded;
          +import org.apache.flink.mesos.scheduler.messages.ReRegistered;
          +import org.apache.flink.mesos.scheduler.messages.Registered;
          +import org.apache.flink.mesos.scheduler.messages.ResourceOffers;
          +import org.apache.flink.mesos.scheduler.messages.StatusUpdate;
          +import org.apache.flink.mesos.util.MesosConfiguration;
          +import org.apache.flink.runtime.clusterframework.ApplicationStatus;
          +import org.apache.flink.runtime.clusterframework.FlinkResourceManager;
          +import org.apache.flink.runtime.clusterframework.messages.FatalErrorOccurred;
          +import org.apache.flink.runtime.clusterframework.messages.StopCluster;
          +import org.apache.flink.runtime.clusterframework.types.ResourceID;
          +import org.apache.flink.runtime.leaderretrieval.LeaderRetrievalService;
          +import org.apache.mesos.Protos;
          +import org.apache.mesos.Protos.FrameworkInfo;
          +import org.apache.mesos.SchedulerDriver;
          +import org.slf4j.Logger;
          +import scala.Option;
          +
          +import java.util.ArrayList;
          +import java.util.Collection;
          +import java.util.HashMap;
          +import java.util.List;
          +import java.util.Map;
          +
          +import static java.util.Objects.requireNonNull;
          +
          +/**
          + * Flink Resource Manager for Apache Mesos.
          + */
          +public class MesosFlinkResourceManager extends FlinkResourceManager<RegisteredMesosWorkerNode> {
          +
          + /** The Mesos configuration (master and framework info) */
          + private final MesosConfiguration mesosConfig;
          +
          + /** The TaskManager container parameters (like container memory size) */
          + private final MesosTaskManagerParameters taskManagerParameters;
          +
          + /** Context information used to start a TaskManager Java process */
          + private final Protos.TaskInfo.Builder taskManagerLaunchContext;
          +
          + /** Number of failed Mesos tasks before stopping the application. -1 means infinite. */
          + private final int maxFailedTasks;
          +
          + /** Callback handler for the asynchronous Mesos scheduler */
          + private SchedulerProxy schedulerCallbackHandler;
          +
          + /** Mesos scheduler driver */
          + private SchedulerDriver schedulerDriver;
          +
          + private ActorRef connectionMonitor;
          +
          + private ActorRef taskRouter;
          +
          + private ActorRef launchCoordinator;
          +
          + private ActorRef reconciliationCoordinator;
          +
          + private MesosWorkerStore workerStore;
          +
          + final Map<ResourceID, MesosWorkerStore.Worker> workersInNew;
          + final Map<ResourceID, MesosWorkerStore.Worker> workersInLaunch;
          + final Map<ResourceID, MesosWorkerStore.Worker> workersBeingReturned;
          +
          + /** The number of failed tasks since the master became active */
          + private int failedTasksSoFar;
          +
          + public MesosFlinkResourceManager(
          + Configuration flinkConfig,
          + MesosConfiguration mesosConfig,
          + MesosWorkerStore workerStore,
          + LeaderRetrievalService leaderRetrievalService,
          + MesosTaskManagerParameters taskManagerParameters,
          + Protos.TaskInfo.Builder taskManagerLaunchContext,
          + int maxFailedTasks,
          + int numInitialTaskManagers)

          { + + super(numInitialTaskManagers, flinkConfig, leaderRetrievalService); + + this.mesosConfig = requireNonNull(mesosConfig); + + this.workerStore = requireNonNull(workerStore); + + this.taskManagerParameters = requireNonNull(taskManagerParameters); + this.taskManagerLaunchContext = requireNonNull(taskManagerLaunchContext); + this.maxFailedTasks = maxFailedTasks; + + this.workersInNew = new HashMap<>(); + this.workersInLaunch = new HashMap<>(); + this.workersBeingReturned = new HashMap<>(); + }

          +
          + // ------------------------------------------------------------------------
          + // Mesos-specific behavior
          + // ------------------------------------------------------------------------
          +
          + @Override
          + protected void initialize() throws Exception {
          + LOG.info("Initializing Mesos resource master");
          +
          + workerStore.start();
          +
          + // create the scheduler driver to communicate with Mesos
          + schedulerCallbackHandler = new SchedulerProxy(self());
          +
          + // register with Mesos
          + FrameworkInfo.Builder frameworkInfo = mesosConfig.frameworkInfo()
          + .clone()
          + .setCheckpoint(true);
          +
          + Option<Protos.FrameworkID> frameworkID = workerStore.getFrameworkID();
          + if(frameworkID.isEmpty())

          { + LOG.info("Registering as new framework."); + }

          + else {
          + LOG.info("Recovery scenario: re-registering using framework ID {}.", frameworkID.get().getValue());
          + frameworkInfo.setId(frameworkID.get());
          + }
          +
          + MesosConfiguration initializedMesosConfig = mesosConfig.withFrameworkInfo(frameworkInfo);
          + MesosConfiguration.logMesosConfig(LOG, initializedMesosConfig);
          + schedulerDriver = initializedMesosConfig.createDriver(schedulerCallbackHandler, false);
          +
          + // create supporting actors
          + connectionMonitor = createConnectionMonitor();
          + launchCoordinator = createLaunchCoordinator();
          + reconciliationCoordinator = createReconciliationCoordinator();
          + taskRouter = createTaskRouter();
          +
          + recoverWorkers();
          +
          + connectionMonitor.tell(new ConnectionMonitor.Start(), self());
          + schedulerDriver.start();
          + }
          +
          + protected ActorRef createConnectionMonitor()

          { + return context().actorOf( + ConnectionMonitor.createActorProps(ConnectionMonitor.class, config), + "connectionMonitor"); + }

          +
          + protected ActorRef createTaskRouter()

          { + return context().actorOf( + Tasks.createActorProps(Tasks.class, config, schedulerDriver, TaskMonitor.class), + "tasks"); + }

          +
          + protected ActorRef createLaunchCoordinator()

          { + return context().actorOf( + LaunchCoordinator.createActorProps(LaunchCoordinator.class, self(), config, schedulerDriver, createOptimizer()), + "launchCoordinator"); + }

          +
          + protected ActorRef createReconciliationCoordinator()

          { + return context().actorOf( + ReconciliationCoordinator.createActorProps(ReconciliationCoordinator.class, config, schedulerDriver), + "reconciliationCoordinator"); + }

          +
          + @Override
          + public void postStop()

          { + LOG.info("Stopping Mesos resource master"); + super.postStop(); + }

          +
          + // ------------------------------------------------------------------------
          + // Actor messages
          + // ------------------------------------------------------------------------
          +
          + @Override
          + protected void handleMessage(Object message) {
          +
          + // check for Mesos-specific actor messages first
          +
          + // — messages about Mesos connection
          + if (message instanceof Registered)

          { + registered((Registered) message); + }

          else if (message instanceof ReRegistered)

          { + reregistered((ReRegistered) message); + }

          else if (message instanceof Disconnected)

          { + disconnected((Disconnected) message); + }

          else if (message instanceof Error)

          { + error(((Error) message).message()); + + // --- messages about offers + }

          else if (message instanceof ResourceOffers || message instanceof OfferRescinded)

          { + launchCoordinator.tell(message, self()); + }

          else if (message instanceof AcceptOffers)

          { + acceptOffers((AcceptOffers) message); + + // --- messages about tasks + }

          else if (message instanceof StatusUpdate)

          { + taskStatusUpdated((StatusUpdate) message); + }

          else if (message instanceof ReconciliationCoordinator.Reconcile)

          { + // a reconciliation request from a task + reconciliationCoordinator.tell(message, self()); + }

          else if (message instanceof TaskMonitor.TaskTerminated)

          { + // a termination message from a task + TaskMonitor.TaskTerminated msg = (TaskMonitor.TaskTerminated) message; + taskTerminated(msg.taskID(), msg.status()); + + }

          else

          { + // message handled by the generic resource master code + super.handleMessage(message); + }

          + }
          +
          + /**
          + * Called to shut down the cluster (not a failover situation).
          + *
          + * @param finalStatus The application status to report.
          + * @param optionalDiagnostics An optional diagnostics message.
          + */
          + @Override
          + protected void shutdownApplication(ApplicationStatus finalStatus, String optionalDiagnostics) {
          +
          + LOG.info("Shutting down and unregistering as a Mesos framework.");
          + try

          { + // unregister the framework, which implicitly removes all tasks. + schedulerDriver.stop(false); + }

          + catch(Exception ex)

          { + LOG.warn("unable to unregister the framework", ex); + }

          +
          + try

          { + workerStore.cleanup(); + }

          + catch(Exception ex)

          { + LOG.warn("unable to cleanup the ZooKeeper state", ex); + }

          +
          + context().stop(self());
          + }
          +
          + @Override
          + protected void fatalError(String message, Throwable error)

          { + // we do not unregister, but cause a hard fail of this process, to have it + // restarted by the dispatcher + LOG.error("FATAL ERROR IN MESOS APPLICATION MASTER: " + message, error); + LOG.error("Shutting down process"); + + // kill this process, this will make an external supervisor (the dispatcher) restart the process + System.exit(EXIT_CODE_FATAL_ERROR); + }

          +
          + // ------------------------------------------------------------------------
          + // Worker Management
          + // ------------------------------------------------------------------------
          +
          + /**
          + * Recover framework/worker information persisted by a prior incarnation of the RM.
          + */
          + private void recoverWorkers() throws Exception {
          + // if this application master starts as part of an ApplicationMaster/JobManager recovery,
          + // then some worker tasks are most likely still alive and we can re-obtain them
          + final List<MesosWorkerStore.Worker> tasksFromPreviousAttempts = workerStore.recoverWorkers();
          +
          + if (!tasksFromPreviousAttempts.isEmpty()) {
          + LOG.info("Retrieved {} TaskManagers from previous attempt", tasksFromPreviousAttempts.size());
          +
          + List<Tuple2<TaskRequest,String>> toAssign = new ArrayList<>(tasksFromPreviousAttempts.size());
          + List<LaunchableTask> toLaunch = new ArrayList<>(tasksFromPreviousAttempts.size());
          +
          + for (final MesosWorkerStore.Worker worker : tasksFromPreviousAttempts) {
          + LaunchableMesosWorker launchable = createLaunchableMesosWorker(worker.taskID());
          +
          + switch(worker.state())

          { + case New: + workersInNew.put(extractResourceID(worker.taskID()), worker); + toLaunch.add(launchable); + break; + case Launched: + workersInLaunch.put(extractResourceID(worker.taskID()), worker); + toAssign.add(new Tuple2<>(launchable.taskRequest(), worker.hostname().get())); + break; + case Released: + workersBeingReturned.put(extractResourceID(worker.taskID()), worker); + break; + }

          + taskRouter.tell(new TaskMonitor.TaskGoalStateUpdated(extractGoalState(worker)), self());
          + }
          +
          + // tell the launch coordinator about prior assignments
          + if(toAssign.size() >= 1)

          { + launchCoordinator.tell(new LaunchCoordinator.Assign(toAssign), self()); + }

          + // tell the launch coordinator to launch any new tasks
          + if(toLaunch.size() >= 1)

          { + launchCoordinator.tell(new LaunchCoordinator.Launch(toLaunch), self()); + }
          + }
          + }
          +
          + /**
          + * Plan for some additional workers to be launched.
          + *
          + * @param numWorkers The number of workers to allocate.
          + */
          + @Override
          + protected void requestNewWorkers(int numWorkers) {
          +
          + try {
          + List<TaskMonitor.TaskGoalStateUpdated> toMonitor = new ArrayList<>(numWorkers);
          + List<LaunchableTask> toLaunch = new ArrayList<>(numWorkers);
          +
          + // generate new workers into persistent state and launch associated actors
          + for (int i = 0; i < numWorkers; i++) {
          + MesosWorkerStore.Worker worker = MesosWorkerStore.Worker.newTask(workerStore.newTaskID());
          + workerStore.putWorker(worker);
          + workersInNew.put(extractResourceID(worker.taskID()), worker);
          +
          + LaunchableMesosWorker launchable = createLaunchableMesosWorker(worker.taskID());
          +
          + LOG.info("Scheduling Mesos task {} with ({} MB, {} cpus).",
          + launchable.taskID().getValue(), launchable.taskRequest().getMemory(), launchable.taskRequest().getCPUs());
          +
          + toMonitor.add(new TaskMonitor.TaskGoalStateUpdated(extractGoalState(worker)));
          + toLaunch.add(launchable);
          + }
          +
          + // tell the task router about the new plans
          + for (TaskMonitor.TaskGoalStateUpdated update : toMonitor) { + taskRouter.tell(update, self()); + }
          +
          + // tell the launch coordinator to launch the new tasks
          + if(toLaunch.size() >= 1) { + launchCoordinator.tell(new LaunchCoordinator.Launch(toLaunch), self()); + }

          + }
          + catch(Exception ex)

          { + fatalError("unable to request new workers", ex); + }

          + }
          +
          + /**
          + * Accept offers as advised by the launch coordinator.
          + *
          + * Acceptance is routed through the RM to update the persistent state before
          + * forwarding the message to Mesos.
          + */
          + private void acceptOffers(AcceptOffers msg) {
          +
          + try {
          + List<TaskMonitor.TaskGoalStateUpdated> toMonitor = new ArrayList<>(msg.operations().size());
          +
          + // transition the persistent state of some tasks to Launched
          + for (Protos.Offer.Operation op : msg.operations()) {
          + if (op.getType() != Protos.Offer.Operation.Type.LAUNCH)

          { + continue; + }

          + for (Protos.TaskInfo info : op.getLaunch().getTaskInfosList()) {
          + MesosWorkerStore.Worker worker = workersInNew.remove(extractResourceID(info.getTaskId()));
          + assert (worker != null);
          +
          + worker = worker.launchTask(info.getSlaveId(), msg.hostname());
          + workerStore.putWorker(worker);
          + workersInLaunch.put(extractResourceID(worker.taskID()), worker);
          +
          + LOG.info("Launching Mesos task {} on host {}.",
          + worker.taskID().getValue(), worker.hostname().get());
          +
          + toMonitor.add(new TaskMonitor.TaskGoalStateUpdated(extractGoalState(worker)));
          + }
          + }
          +
          + // tell the task router about the new plans
          + for (TaskMonitor.TaskGoalStateUpdated update : toMonitor)

          { + taskRouter.tell(update, self()); + }

          +
          + // send the acceptance message to Mesos
          + schedulerDriver.acceptOffers(msg.offerIds(), msg.operations(), msg.filters());
          + }
          + catch(Exception ex)

          { + fatalError("unable to accept offers", ex); + }

          + }
          +
          + /**
          + * Handle a task status change.
          + */
          + private void taskStatusUpdated(StatusUpdate message)

          { + taskRouter.tell(message, self()); + reconciliationCoordinator.tell(message, self()); + schedulerDriver.acknowledgeStatusUpdate(message.status()); + }

          +
          + /**
          + * Accept the given started worker into the internal state.
          + *
          + * @param resourceID The worker resource id
          + * @return A registered worker node record.
          + */
          + @Override
          + protected RegisteredMesosWorkerNode workerStarted(ResourceID resourceID) {
          + MesosWorkerStore.Worker inLaunch = workersInLaunch.remove(resourceID);
          + if (inLaunch == null)

          { + // Worker was not in state "being launched", this can indicate that the TaskManager + // in this worker was already registered or that the container was not started + // by this resource manager. Simply ignore this resourceID. + return null; + }

          + return new RegisteredMesosWorkerNode(inLaunch);
          + }
          +
          + /**
          + * Accept the given registered workers into the internal state.
          + *
          + * @param toConsolidate The worker IDs known previously to the JobManager.
          + * @return A collection of registered worker node records.
          + */
          + @Override
          + protected Collection<RegisteredMesosWorkerNode> reacceptRegisteredWorkers(Collection<ResourceID> toConsolidate) {
          +
          + // we check for each task manager if we recognize its Mesos task ID
          + List<RegisteredMesosWorkerNode> accepted = new ArrayList<>(toConsolidate.size());
          + for (ResourceID resourceID : toConsolidate) {
          + MesosWorkerStore.Worker worker = workersInLaunch.remove(resourceID);
          + if (worker != null) {
          + LOG.info("Mesos worker consolidation recognizes TaskManager {}.", resourceID);
          + accepted.add(new RegisteredMesosWorkerNode(worker));
          + }
          + else {
          + if(isStarted(resourceID)) {
          + LOG.info("TaskManager {} has already been registered at the resource manager.", resourceID);
          + }
          + else {
          + LOG.info("Mesos worker consolidation does not recognize TaskManager {}.", resourceID);
          + }
          + }
          + }
          + return accepted;
          + }
          +
          + /**
          + * Release the given pending worker.
          + */
          + @Override
          + protected void releasePendingWorker(ResourceID id) {
          + MesosWorkerStore.Worker worker = workersInLaunch.remove(id);
          + if (worker != null)

          { + releaseWorker(worker); + }

          else {
          + LOG.error("Cannot find worker {} to release. Ignoring request.", id);
          + }
          + }
          +
          + /**
          + * Release the given started worker.
          + */
          + @Override
          + protected void releaseStartedWorker(RegisteredMesosWorkerNode worker)

          { + releaseWorker(worker.task()); + }

          +
          + /**
          + * Plan for the removal of the given worker.
          + */
          + private void releaseWorker(MesosWorkerStore.Worker worker) {
          + try {
          + LOG.info("Releasing worker {}", worker.taskID());
          +
          + // update persistent state of worker to Released
          + worker = worker.releaseTask();
          + workerStore.putWorker(worker);
          + workersBeingReturned.put(extractResourceID(worker.taskID()), worker);
          + taskRouter.tell(new TaskMonitor.TaskGoalStateUpdated(extractGoalState(worker)), self());
          +
          + if (worker.hostname().isDefined())

          { + // tell the launch coordinator that the task is being unassigned from the host, for planning purposes + launchCoordinator.tell(new LaunchCoordinator.Unassign(worker.taskID(), worker.hostname().get()), self()); + }

          + }
          + catch (Exception ex)

          { + fatalError("unable to release worker", ex); + }

          + }
          +
          + @Override
          + protected int getNumWorkerRequestsPending()

          { + return workersInNew.size(); + }

          +
          + @Override
          + protected int getNumWorkersPendingRegistration()

          { + return workersInLaunch.size(); + }

          +
          + // ------------------------------------------------------------------------
          + // Callbacks from the Mesos Master
          + // ------------------------------------------------------------------------
          +
          + /**
          + * Called when connected to Mesos as a new framework.
          + */
          + private void registered(Registered message) {
          + connectionMonitor.tell(message, self());
          +
          + try

          { + workerStore.setFrameworkID(Option.apply(message.frameworkId())); + }

          + catch(Exception ex)

          { + fatalError("unable to store the assigned framework ID", ex); + return; + }

          +
          + launchCoordinator.tell(message, self());
          + reconciliationCoordinator.tell(message, self());
          + taskRouter.tell(message, self());
          + }
          +
          + /**
          + * Called when reconnected to Mesos following a failover event.
          + */
          + private void reregistered(ReRegistered message)

          { + connectionMonitor.tell(message, self()); + launchCoordinator.tell(message, self()); + reconciliationCoordinator.tell(message, self()); + taskRouter.tell(message, self()); + }
          +
          + /**
          + * Called when disconnected from Mesos.
          + */
          + private void disconnected(Disconnected message) { + connectionMonitor.tell(message, self()); + launchCoordinator.tell(message, self()); + reconciliationCoordinator.tell(message, self()); + taskRouter.tell(message, self()); + }

          +
          + /**
          + * Called when an error is reported by the scheduler callback.
          + */
          + private void error(String message)

          { + self().tell(new FatalErrorOccurred("Connection to Mesos failed", new Exception(message)), self()); + }

          +
          + /**
          + * Invoked when a Mesos task reaches a terminal status.
          + */
          + private void taskTerminated(Protos.TaskID taskID, Protos.TaskStatus status) {
          + // this callback occurs for failed containers and for released containers alike
          +
          + final ResourceID id = extractResourceID(taskID);
          +
          + try

          { + workerStore.removeWorker(taskID); + }

          + catch(Exception ex)

          { + fatalError("unable to remove worker", ex); + return; + }

          +
          + // check if this is a failed task or a released task
          + if (workersBeingReturned.remove(id) != null) {
          + // regular finished worker that we released
          + LOG.info("Worker {} finished successfully with diagnostics: {}",
          + id, status.getMessage());
          + } else {
          + // failed worker, either at startup, or running
          + final MesosWorkerStore.Worker launched = workersInLaunch.remove(id);
          + if (launched != null) {
          + LOG.info("Mesos task {} failed, with a TaskManager in launch or registration. " +
          + "State: {} Reason: {} ({})", id, status.getState(), status.getReason(), status.getMessage());
          + // we will trigger re-acquiring new workers at the end
          + } else {
          + // failed registered worker
          + LOG.info("Mesos task {} failed, with a registered TaskManager. " +
          + "State: {} Reason: {} ({})", id, status.getState(), status.getReason(), status.getMessage());
          +
          + // notify the generic logic, which notifies the JobManager, etc.
          + notifyWorkerFailed(id, "Mesos task " + id + " failed. State: " + status.getState());
          + }
          +
          + // general failure logging
          + failedTasksSoFar++;
          +
          + String diagMessage = String.format("Diagnostics for task %s in state %s : " +
          + "reason=%s message=%s",
          + id, status.getState(), status.getReason(), status.getMessage());
          + sendInfoMessage(diagMessage);
          +
          + LOG.info(diagMessage);
          + LOG.info("Total number of failed tasks so far: " + failedTasksSoFar);
          +
          + // maxFailedTasks == -1 is infinite number of retries.
          + if (maxFailedTasks >= 0 && failedTasksSoFar > maxFailedTasks)

          { + String msg = "Stopping Mesos session because the number of failed tasks (" + + failedTasksSoFar + ") exceeded the maximum failed tasks (" + + maxFailedTasks + "). This number is controlled by the '" + + ConfigConstants.MESOS_MAX_FAILED_TASKS + "' configuration setting. " + + "By default its the number of requested tasks."; + + LOG.error(msg); + self().tell(decorateMessage(new StopCluster(ApplicationStatus.FAILED, msg)), + ActorRef.noSender()); + + // no need to do anything else + return; + }

          + }
          +
          + // in case failed containers were among the finished containers, make
          + // sure we re-examine and request new ones
          + triggerCheckWorkers();
          + }
          +
          + // ------------------------------------------------------------------------
          + // Utilities
          + // ------------------------------------------------------------------------
          +
          + private LaunchableMesosWorker createLaunchableMesosWorker(Protos.TaskID taskID) {
          + LaunchableMesosWorker launchable =
          + new LaunchableMesosWorker(taskManagerParameters, taskManagerLaunchContext, taskID);
          + return launchable;
          — End diff –

          That's fine. It looks more concise on one line but I agree that it makes debugging more complicated.

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75641421 — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/MesosFlinkResourceManager.java — @@ -0,0 +1,755 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.runtime.clusterframework; + +import akka.actor.ActorRef; +import akka.actor.Props; +import com.netflix.fenzo.TaskRequest; +import com.netflix.fenzo.TaskScheduler; +import com.netflix.fenzo.VirtualMachineLease; +import com.netflix.fenzo.functions.Action1; +import org.apache.flink.api.java.tuple.Tuple2; +import org.apache.flink.configuration.ConfigConstants; +import org.apache.flink.configuration.Configuration; +import org.apache.flink.mesos.runtime.clusterframework.store.MesosWorkerStore; +import org.apache.flink.mesos.scheduler.ConnectionMonitor; +import org.apache.flink.mesos.scheduler.LaunchableTask; +import org.apache.flink.mesos.scheduler.LaunchCoordinator; +import org.apache.flink.mesos.scheduler.ReconciliationCoordinator; +import org.apache.flink.mesos.scheduler.SchedulerProxy; +import org.apache.flink.mesos.scheduler.TaskMonitor; +import org.apache.flink.mesos.scheduler.TaskSchedulerBuilder; +import org.apache.flink.mesos.scheduler.Tasks; +import org.apache.flink.mesos.scheduler.messages.AcceptOffers; +import org.apache.flink.mesos.scheduler.messages.Disconnected; +import org.apache.flink.mesos.scheduler.messages.Error; +import org.apache.flink.mesos.scheduler.messages.OfferRescinded; +import org.apache.flink.mesos.scheduler.messages.ReRegistered; +import org.apache.flink.mesos.scheduler.messages.Registered; +import org.apache.flink.mesos.scheduler.messages.ResourceOffers; +import org.apache.flink.mesos.scheduler.messages.StatusUpdate; +import org.apache.flink.mesos.util.MesosConfiguration; +import org.apache.flink.runtime.clusterframework.ApplicationStatus; +import org.apache.flink.runtime.clusterframework.FlinkResourceManager; +import org.apache.flink.runtime.clusterframework.messages.FatalErrorOccurred; +import org.apache.flink.runtime.clusterframework.messages.StopCluster; +import org.apache.flink.runtime.clusterframework.types.ResourceID; +import org.apache.flink.runtime.leaderretrieval.LeaderRetrievalService; +import org.apache.mesos.Protos; +import org.apache.mesos.Protos.FrameworkInfo; +import org.apache.mesos.SchedulerDriver; +import org.slf4j.Logger; +import scala.Option; + +import java.util.ArrayList; +import java.util.Collection; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +import static java.util.Objects.requireNonNull; + +/** + * Flink Resource Manager for Apache Mesos. + */ +public class MesosFlinkResourceManager extends FlinkResourceManager<RegisteredMesosWorkerNode> { + + /** The Mesos configuration (master and framework info) */ + private final MesosConfiguration mesosConfig; + + /** The TaskManager container parameters (like container memory size) */ + private final MesosTaskManagerParameters taskManagerParameters; + + /** Context information used to start a TaskManager Java process */ + private final Protos.TaskInfo.Builder taskManagerLaunchContext; + + /** Number of failed Mesos tasks before stopping the application. -1 means infinite. */ + private final int maxFailedTasks; + + /** Callback handler for the asynchronous Mesos scheduler */ + private SchedulerProxy schedulerCallbackHandler; + + /** Mesos scheduler driver */ + private SchedulerDriver schedulerDriver; + + private ActorRef connectionMonitor; + + private ActorRef taskRouter; + + private ActorRef launchCoordinator; + + private ActorRef reconciliationCoordinator; + + private MesosWorkerStore workerStore; + + final Map<ResourceID, MesosWorkerStore.Worker> workersInNew; + final Map<ResourceID, MesosWorkerStore.Worker> workersInLaunch; + final Map<ResourceID, MesosWorkerStore.Worker> workersBeingReturned; + + /** The number of failed tasks since the master became active */ + private int failedTasksSoFar; + + public MesosFlinkResourceManager( + Configuration flinkConfig, + MesosConfiguration mesosConfig, + MesosWorkerStore workerStore, + LeaderRetrievalService leaderRetrievalService, + MesosTaskManagerParameters taskManagerParameters, + Protos.TaskInfo.Builder taskManagerLaunchContext, + int maxFailedTasks, + int numInitialTaskManagers) { + + super(numInitialTaskManagers, flinkConfig, leaderRetrievalService); + + this.mesosConfig = requireNonNull(mesosConfig); + + this.workerStore = requireNonNull(workerStore); + + this.taskManagerParameters = requireNonNull(taskManagerParameters); + this.taskManagerLaunchContext = requireNonNull(taskManagerLaunchContext); + this.maxFailedTasks = maxFailedTasks; + + this.workersInNew = new HashMap<>(); + this.workersInLaunch = new HashMap<>(); + this.workersBeingReturned = new HashMap<>(); + } + + // ------------------------------------------------------------------------ + // Mesos-specific behavior + // ------------------------------------------------------------------------ + + @Override + protected void initialize() throws Exception { + LOG.info("Initializing Mesos resource master"); + + workerStore.start(); + + // create the scheduler driver to communicate with Mesos + schedulerCallbackHandler = new SchedulerProxy(self()); + + // register with Mesos + FrameworkInfo.Builder frameworkInfo = mesosConfig.frameworkInfo() + .clone() + .setCheckpoint(true); + + Option<Protos.FrameworkID> frameworkID = workerStore.getFrameworkID(); + if(frameworkID.isEmpty()) { + LOG.info("Registering as new framework."); + } + else { + LOG.info("Recovery scenario: re-registering using framework ID {}.", frameworkID.get().getValue()); + frameworkInfo.setId(frameworkID.get()); + } + + MesosConfiguration initializedMesosConfig = mesosConfig.withFrameworkInfo(frameworkInfo); + MesosConfiguration.logMesosConfig(LOG, initializedMesosConfig); + schedulerDriver = initializedMesosConfig.createDriver(schedulerCallbackHandler, false); + + // create supporting actors + connectionMonitor = createConnectionMonitor(); + launchCoordinator = createLaunchCoordinator(); + reconciliationCoordinator = createReconciliationCoordinator(); + taskRouter = createTaskRouter(); + + recoverWorkers(); + + connectionMonitor.tell(new ConnectionMonitor.Start(), self()); + schedulerDriver.start(); + } + + protected ActorRef createConnectionMonitor() { + return context().actorOf( + ConnectionMonitor.createActorProps(ConnectionMonitor.class, config), + "connectionMonitor"); + } + + protected ActorRef createTaskRouter() { + return context().actorOf( + Tasks.createActorProps(Tasks.class, config, schedulerDriver, TaskMonitor.class), + "tasks"); + } + + protected ActorRef createLaunchCoordinator() { + return context().actorOf( + LaunchCoordinator.createActorProps(LaunchCoordinator.class, self(), config, schedulerDriver, createOptimizer()), + "launchCoordinator"); + } + + protected ActorRef createReconciliationCoordinator() { + return context().actorOf( + ReconciliationCoordinator.createActorProps(ReconciliationCoordinator.class, config, schedulerDriver), + "reconciliationCoordinator"); + } + + @Override + public void postStop() { + LOG.info("Stopping Mesos resource master"); + super.postStop(); + } + + // ------------------------------------------------------------------------ + // Actor messages + // ------------------------------------------------------------------------ + + @Override + protected void handleMessage(Object message) { + + // check for Mesos-specific actor messages first + + // — messages about Mesos connection + if (message instanceof Registered) { + registered((Registered) message); + } else if (message instanceof ReRegistered) { + reregistered((ReRegistered) message); + } else if (message instanceof Disconnected) { + disconnected((Disconnected) message); + } else if (message instanceof Error) { + error(((Error) message).message()); + + // --- messages about offers + } else if (message instanceof ResourceOffers || message instanceof OfferRescinded) { + launchCoordinator.tell(message, self()); + } else if (message instanceof AcceptOffers) { + acceptOffers((AcceptOffers) message); + + // --- messages about tasks + } else if (message instanceof StatusUpdate) { + taskStatusUpdated((StatusUpdate) message); + } else if (message instanceof ReconciliationCoordinator.Reconcile) { + // a reconciliation request from a task + reconciliationCoordinator.tell(message, self()); + } else if (message instanceof TaskMonitor.TaskTerminated) { + // a termination message from a task + TaskMonitor.TaskTerminated msg = (TaskMonitor.TaskTerminated) message; + taskTerminated(msg.taskID(), msg.status()); + + } else { + // message handled by the generic resource master code + super.handleMessage(message); + } + } + + /** + * Called to shut down the cluster (not a failover situation). + * + * @param finalStatus The application status to report. + * @param optionalDiagnostics An optional diagnostics message. + */ + @Override + protected void shutdownApplication(ApplicationStatus finalStatus, String optionalDiagnostics) { + + LOG.info("Shutting down and unregistering as a Mesos framework."); + try { + // unregister the framework, which implicitly removes all tasks. + schedulerDriver.stop(false); + } + catch(Exception ex) { + LOG.warn("unable to unregister the framework", ex); + } + + try { + workerStore.cleanup(); + } + catch(Exception ex) { + LOG.warn("unable to cleanup the ZooKeeper state", ex); + } + + context().stop(self()); + } + + @Override + protected void fatalError(String message, Throwable error) { + // we do not unregister, but cause a hard fail of this process, to have it + // restarted by the dispatcher + LOG.error("FATAL ERROR IN MESOS APPLICATION MASTER: " + message, error); + LOG.error("Shutting down process"); + + // kill this process, this will make an external supervisor (the dispatcher) restart the process + System.exit(EXIT_CODE_FATAL_ERROR); + } + + // ------------------------------------------------------------------------ + // Worker Management + // ------------------------------------------------------------------------ + + /** + * Recover framework/worker information persisted by a prior incarnation of the RM. + */ + private void recoverWorkers() throws Exception { + // if this application master starts as part of an ApplicationMaster/JobManager recovery, + // then some worker tasks are most likely still alive and we can re-obtain them + final List<MesosWorkerStore.Worker> tasksFromPreviousAttempts = workerStore.recoverWorkers(); + + if (!tasksFromPreviousAttempts.isEmpty()) { + LOG.info("Retrieved {} TaskManagers from previous attempt", tasksFromPreviousAttempts.size()); + + List<Tuple2<TaskRequest,String>> toAssign = new ArrayList<>(tasksFromPreviousAttempts.size()); + List<LaunchableTask> toLaunch = new ArrayList<>(tasksFromPreviousAttempts.size()); + + for (final MesosWorkerStore.Worker worker : tasksFromPreviousAttempts) { + LaunchableMesosWorker launchable = createLaunchableMesosWorker(worker.taskID()); + + switch(worker.state()) { + case New: + workersInNew.put(extractResourceID(worker.taskID()), worker); + toLaunch.add(launchable); + break; + case Launched: + workersInLaunch.put(extractResourceID(worker.taskID()), worker); + toAssign.add(new Tuple2<>(launchable.taskRequest(), worker.hostname().get())); + break; + case Released: + workersBeingReturned.put(extractResourceID(worker.taskID()), worker); + break; + } + taskRouter.tell(new TaskMonitor.TaskGoalStateUpdated(extractGoalState(worker)), self()); + } + + // tell the launch coordinator about prior assignments + if(toAssign.size() >= 1) { + launchCoordinator.tell(new LaunchCoordinator.Assign(toAssign), self()); + } + // tell the launch coordinator to launch any new tasks + if(toLaunch.size() >= 1) { + launchCoordinator.tell(new LaunchCoordinator.Launch(toLaunch), self()); + } + } + } + + /** + * Plan for some additional workers to be launched. + * + * @param numWorkers The number of workers to allocate. + */ + @Override + protected void requestNewWorkers(int numWorkers) { + + try { + List<TaskMonitor.TaskGoalStateUpdated> toMonitor = new ArrayList<>(numWorkers); + List<LaunchableTask> toLaunch = new ArrayList<>(numWorkers); + + // generate new workers into persistent state and launch associated actors + for (int i = 0; i < numWorkers; i++) { + MesosWorkerStore.Worker worker = MesosWorkerStore.Worker.newTask(workerStore.newTaskID()); + workerStore.putWorker(worker); + workersInNew.put(extractResourceID(worker.taskID()), worker); + + LaunchableMesosWorker launchable = createLaunchableMesosWorker(worker.taskID()); + + LOG.info("Scheduling Mesos task {} with ({} MB, {} cpus).", + launchable.taskID().getValue(), launchable.taskRequest().getMemory(), launchable.taskRequest().getCPUs()); + + toMonitor.add(new TaskMonitor.TaskGoalStateUpdated(extractGoalState(worker))); + toLaunch.add(launchable); + } + + // tell the task router about the new plans + for (TaskMonitor.TaskGoalStateUpdated update : toMonitor) { + taskRouter.tell(update, self()); + } + + // tell the launch coordinator to launch the new tasks + if(toLaunch.size() >= 1) { + launchCoordinator.tell(new LaunchCoordinator.Launch(toLaunch), self()); + } + } + catch(Exception ex) { + fatalError("unable to request new workers", ex); + } + } + + /** + * Accept offers as advised by the launch coordinator. + * + * Acceptance is routed through the RM to update the persistent state before + * forwarding the message to Mesos. + */ + private void acceptOffers(AcceptOffers msg) { + + try { + List<TaskMonitor.TaskGoalStateUpdated> toMonitor = new ArrayList<>(msg.operations().size()); + + // transition the persistent state of some tasks to Launched + for (Protos.Offer.Operation op : msg.operations()) { + if (op.getType() != Protos.Offer.Operation.Type.LAUNCH) { + continue; + } + for (Protos.TaskInfo info : op.getLaunch().getTaskInfosList()) { + MesosWorkerStore.Worker worker = workersInNew.remove(extractResourceID(info.getTaskId())); + assert (worker != null); + + worker = worker.launchTask(info.getSlaveId(), msg.hostname()); + workerStore.putWorker(worker); + workersInLaunch.put(extractResourceID(worker.taskID()), worker); + + LOG.info("Launching Mesos task {} on host {}.", + worker.taskID().getValue(), worker.hostname().get()); + + toMonitor.add(new TaskMonitor.TaskGoalStateUpdated(extractGoalState(worker))); + } + } + + // tell the task router about the new plans + for (TaskMonitor.TaskGoalStateUpdated update : toMonitor) { + taskRouter.tell(update, self()); + } + + // send the acceptance message to Mesos + schedulerDriver.acceptOffers(msg.offerIds(), msg.operations(), msg.filters()); + } + catch(Exception ex) { + fatalError("unable to accept offers", ex); + } + } + + /** + * Handle a task status change. + */ + private void taskStatusUpdated(StatusUpdate message) { + taskRouter.tell(message, self()); + reconciliationCoordinator.tell(message, self()); + schedulerDriver.acknowledgeStatusUpdate(message.status()); + } + + /** + * Accept the given started worker into the internal state. + * + * @param resourceID The worker resource id + * @return A registered worker node record. + */ + @Override + protected RegisteredMesosWorkerNode workerStarted(ResourceID resourceID) { + MesosWorkerStore.Worker inLaunch = workersInLaunch.remove(resourceID); + if (inLaunch == null) { + // Worker was not in state "being launched", this can indicate that the TaskManager + // in this worker was already registered or that the container was not started + // by this resource manager. Simply ignore this resourceID. + return null; + } + return new RegisteredMesosWorkerNode(inLaunch); + } + + /** + * Accept the given registered workers into the internal state. + * + * @param toConsolidate The worker IDs known previously to the JobManager. + * @return A collection of registered worker node records. + */ + @Override + protected Collection<RegisteredMesosWorkerNode> reacceptRegisteredWorkers(Collection<ResourceID> toConsolidate) { + + // we check for each task manager if we recognize its Mesos task ID + List<RegisteredMesosWorkerNode> accepted = new ArrayList<>(toConsolidate.size()); + for (ResourceID resourceID : toConsolidate) { + MesosWorkerStore.Worker worker = workersInLaunch.remove(resourceID); + if (worker != null) { + LOG.info("Mesos worker consolidation recognizes TaskManager {}.", resourceID); + accepted.add(new RegisteredMesosWorkerNode(worker)); + } + else { + if(isStarted(resourceID)) { + LOG.info("TaskManager {} has already been registered at the resource manager.", resourceID); + } + else { + LOG.info("Mesos worker consolidation does not recognize TaskManager {}.", resourceID); + } + } + } + return accepted; + } + + /** + * Release the given pending worker. + */ + @Override + protected void releasePendingWorker(ResourceID id) { + MesosWorkerStore.Worker worker = workersInLaunch.remove(id); + if (worker != null) { + releaseWorker(worker); + } else { + LOG.error("Cannot find worker {} to release. Ignoring request.", id); + } + } + + /** + * Release the given started worker. + */ + @Override + protected void releaseStartedWorker(RegisteredMesosWorkerNode worker) { + releaseWorker(worker.task()); + } + + /** + * Plan for the removal of the given worker. + */ + private void releaseWorker(MesosWorkerStore.Worker worker) { + try { + LOG.info("Releasing worker {}", worker.taskID()); + + // update persistent state of worker to Released + worker = worker.releaseTask(); + workerStore.putWorker(worker); + workersBeingReturned.put(extractResourceID(worker.taskID()), worker); + taskRouter.tell(new TaskMonitor.TaskGoalStateUpdated(extractGoalState(worker)), self()); + + if (worker.hostname().isDefined()) { + // tell the launch coordinator that the task is being unassigned from the host, for planning purposes + launchCoordinator.tell(new LaunchCoordinator.Unassign(worker.taskID(), worker.hostname().get()), self()); + } + } + catch (Exception ex) { + fatalError("unable to release worker", ex); + } + } + + @Override + protected int getNumWorkerRequestsPending() { + return workersInNew.size(); + } + + @Override + protected int getNumWorkersPendingRegistration() { + return workersInLaunch.size(); + } + + // ------------------------------------------------------------------------ + // Callbacks from the Mesos Master + // ------------------------------------------------------------------------ + + /** + * Called when connected to Mesos as a new framework. + */ + private void registered(Registered message) { + connectionMonitor.tell(message, self()); + + try { + workerStore.setFrameworkID(Option.apply(message.frameworkId())); + } + catch(Exception ex) { + fatalError("unable to store the assigned framework ID", ex); + return; + } + + launchCoordinator.tell(message, self()); + reconciliationCoordinator.tell(message, self()); + taskRouter.tell(message, self()); + } + + /** + * Called when reconnected to Mesos following a failover event. + */ + private void reregistered(ReRegistered message) { + connectionMonitor.tell(message, self()); + launchCoordinator.tell(message, self()); + reconciliationCoordinator.tell(message, self()); + taskRouter.tell(message, self()); + } + + /** + * Called when disconnected from Mesos. + */ + private void disconnected(Disconnected message) { + connectionMonitor.tell(message, self()); + launchCoordinator.tell(message, self()); + reconciliationCoordinator.tell(message, self()); + taskRouter.tell(message, self()); + } + + /** + * Called when an error is reported by the scheduler callback. + */ + private void error(String message) { + self().tell(new FatalErrorOccurred("Connection to Mesos failed", new Exception(message)), self()); + } + + /** + * Invoked when a Mesos task reaches a terminal status. + */ + private void taskTerminated(Protos.TaskID taskID, Protos.TaskStatus status) { + // this callback occurs for failed containers and for released containers alike + + final ResourceID id = extractResourceID(taskID); + + try { + workerStore.removeWorker(taskID); + } + catch(Exception ex) { + fatalError("unable to remove worker", ex); + return; + } + + // check if this is a failed task or a released task + if (workersBeingReturned.remove(id) != null) { + // regular finished worker that we released + LOG.info("Worker {} finished successfully with diagnostics: {}", + id, status.getMessage()); + } else { + // failed worker, either at startup, or running + final MesosWorkerStore.Worker launched = workersInLaunch.remove(id); + if (launched != null) { + LOG.info("Mesos task {} failed, with a TaskManager in launch or registration. " + + "State: {} Reason: {} ({})", id, status.getState(), status.getReason(), status.getMessage()); + // we will trigger re-acquiring new workers at the end + } else { + // failed registered worker + LOG.info("Mesos task {} failed, with a registered TaskManager. " + + "State: {} Reason: {} ({})", id, status.getState(), status.getReason(), status.getMessage()); + + // notify the generic logic, which notifies the JobManager, etc. + notifyWorkerFailed(id, "Mesos task " + id + " failed. State: " + status.getState()); + } + + // general failure logging + failedTasksSoFar++; + + String diagMessage = String.format("Diagnostics for task %s in state %s : " + + "reason=%s message=%s", + id, status.getState(), status.getReason(), status.getMessage()); + sendInfoMessage(diagMessage); + + LOG.info(diagMessage); + LOG.info("Total number of failed tasks so far: " + failedTasksSoFar); + + // maxFailedTasks == -1 is infinite number of retries. + if (maxFailedTasks >= 0 && failedTasksSoFar > maxFailedTasks) { + String msg = "Stopping Mesos session because the number of failed tasks (" + + failedTasksSoFar + ") exceeded the maximum failed tasks (" + + maxFailedTasks + "). This number is controlled by the '" + + ConfigConstants.MESOS_MAX_FAILED_TASKS + "' configuration setting. " + + "By default its the number of requested tasks."; + + LOG.error(msg); + self().tell(decorateMessage(new StopCluster(ApplicationStatus.FAILED, msg)), + ActorRef.noSender()); + + // no need to do anything else + return; + } + } + + // in case failed containers were among the finished containers, make + // sure we re-examine and request new ones + triggerCheckWorkers(); + } + + // ------------------------------------------------------------------------ + // Utilities + // ------------------------------------------------------------------------ + + private LaunchableMesosWorker createLaunchableMesosWorker(Protos.TaskID taskID) { + LaunchableMesosWorker launchable = + new LaunchableMesosWorker(taskManagerParameters, taskManagerLaunchContext, taskID); + return launchable; — End diff – That's fine. It looks more concise on one line but I agree that it makes debugging more complicated.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user EronWright commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75587011

          — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/LaunchableMesosWorker.java —
          @@ -0,0 +1,205 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.runtime.clusterframework;
          +
          +import com.netflix.fenzo.ConstraintEvaluator;
          +import com.netflix.fenzo.TaskAssignmentResult;
          +import com.netflix.fenzo.TaskRequest;
          +import com.netflix.fenzo.VMTaskFitnessCalculator;
          +import org.apache.flink.configuration.Configuration;
          +import org.apache.flink.mesos.cli.FlinkMesosSessionCli;
          +import org.apache.flink.mesos.scheduler.LaunchableTask;
          +import org.apache.mesos.Protos;
          +
          +import java.util.Collections;
          +import java.util.List;
          +import java.util.Map;
          +import java.util.concurrent.atomic.AtomicReference;
          +
          +import static org.apache.flink.mesos.Utils.variable;
          +import static org.apache.flink.mesos.Utils.range;
          +import static org.apache.flink.mesos.Utils.ranges;
          +import static org.apache.flink.mesos.Utils.scalar;
          +
          +/**
          + * Specifies how to launch a Mesos worker.
          + */
          +public class LaunchableMesosWorker implements LaunchableTask {
          +
          + /**
          + * The set of configuration keys to be dynamically configured with a port allocated from Mesos.
          + */
          + private static String[] TM_PORT_KEYS =

          { + "taskmanager.rpc.port", + "taskmanager.data.port" }

          ;
          +
          + private final MesosTaskManagerParameters params;
          + private final Protos.TaskInfo.Builder template;
          + private final Protos.TaskID taskID;
          + private final Request taskRequest;
          +
          + /**
          + * Construct a launchable Mesos worker.
          + * @param params the TM parameters such as memory, cpu to acquire.
          + * @param template a template for the TaskInfo to be constructed at launch time.
          + * @param taskID the taskID for this worker.
          + */
          + public LaunchableMesosWorker(MesosTaskManagerParameters params, Protos.TaskInfo.Builder template, Protos.TaskID taskID)

          { + this.params = params; + this.template = template; + this.taskID = taskID; + this.taskRequest = new Request(); + }

          +
          + public Protos.TaskID taskID()

          { + return taskID; + }

          +
          + @Override
          + public TaskRequest taskRequest()

          { + return taskRequest; + }

          +
          + class Request implements TaskRequest {
          + private final AtomicReference<TaskRequest.AssignedResources> assignedResources = new AtomicReference<>();
          +
          + @Override
          + public String getId()

          { + return taskID.getValue(); + }

          +
          + @Override
          + public String taskGroupName()

          { + return ""; + }

          +
          + @Override
          + public double getCPUs()

          { + return params.cpus(); + }

          +
          + @Override
          + public double getMemory()

          { + return params.containeredParameters().taskManagerTotalMemoryMB(); + }

          +
          + @Override
          + public double getNetworkMbps()

          { + return 0.0; + }

          +
          + @Override
          + public double getDisk() {
          + return 0.0;
          — End diff –

          To be clear, this code implements the Fenzo `TaskRequest` interface, and `getDisk` must return a value.

          Show
          githubbot ASF GitHub Bot added a comment - Github user EronWright commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75587011 — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/LaunchableMesosWorker.java — @@ -0,0 +1,205 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.runtime.clusterframework; + +import com.netflix.fenzo.ConstraintEvaluator; +import com.netflix.fenzo.TaskAssignmentResult; +import com.netflix.fenzo.TaskRequest; +import com.netflix.fenzo.VMTaskFitnessCalculator; +import org.apache.flink.configuration.Configuration; +import org.apache.flink.mesos.cli.FlinkMesosSessionCli; +import org.apache.flink.mesos.scheduler.LaunchableTask; +import org.apache.mesos.Protos; + +import java.util.Collections; +import java.util.List; +import java.util.Map; +import java.util.concurrent.atomic.AtomicReference; + +import static org.apache.flink.mesos.Utils.variable; +import static org.apache.flink.mesos.Utils.range; +import static org.apache.flink.mesos.Utils.ranges; +import static org.apache.flink.mesos.Utils.scalar; + +/** + * Specifies how to launch a Mesos worker. + */ +public class LaunchableMesosWorker implements LaunchableTask { + + /** + * The set of configuration keys to be dynamically configured with a port allocated from Mesos. + */ + private static String[] TM_PORT_KEYS = { + "taskmanager.rpc.port", + "taskmanager.data.port" } ; + + private final MesosTaskManagerParameters params; + private final Protos.TaskInfo.Builder template; + private final Protos.TaskID taskID; + private final Request taskRequest; + + /** + * Construct a launchable Mesos worker. + * @param params the TM parameters such as memory, cpu to acquire. + * @param template a template for the TaskInfo to be constructed at launch time. + * @param taskID the taskID for this worker. + */ + public LaunchableMesosWorker(MesosTaskManagerParameters params, Protos.TaskInfo.Builder template, Protos.TaskID taskID) { + this.params = params; + this.template = template; + this.taskID = taskID; + this.taskRequest = new Request(); + } + + public Protos.TaskID taskID() { + return taskID; + } + + @Override + public TaskRequest taskRequest() { + return taskRequest; + } + + class Request implements TaskRequest { + private final AtomicReference<TaskRequest.AssignedResources> assignedResources = new AtomicReference<>(); + + @Override + public String getId() { + return taskID.getValue(); + } + + @Override + public String taskGroupName() { + return ""; + } + + @Override + public double getCPUs() { + return params.cpus(); + } + + @Override + public double getMemory() { + return params.containeredParameters().taskManagerTotalMemoryMB(); + } + + @Override + public double getNetworkMbps() { + return 0.0; + } + + @Override + public double getDisk() { + return 0.0; — End diff – To be clear, this code implements the Fenzo `TaskRequest` interface, and `getDisk` must return a value.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user EronWright commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75395325

          — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/util/MesosArtifactServer.java —
          @@ -0,0 +1,304 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.util;
          +
          +import io.netty.bootstrap.ServerBootstrap;
          +import io.netty.buffer.Unpooled;
          +import io.netty.channel.Channel;
          +import io.netty.channel.ChannelFuture;
          +import io.netty.channel.ChannelFutureListener;
          +import io.netty.channel.ChannelHandler;
          +import io.netty.channel.ChannelHandlerContext;
          +import io.netty.channel.ChannelInitializer;
          +import io.netty.channel.DefaultFileRegion;
          +import io.netty.channel.SimpleChannelInboundHandler;
          +import io.netty.channel.nio.NioEventLoopGroup;
          +import io.netty.channel.socket.SocketChannel;
          +import io.netty.channel.socket.nio.NioServerSocketChannel;
          +import io.netty.handler.codec.http.DefaultFullHttpResponse;
          +import io.netty.handler.codec.http.DefaultHttpResponse;
          +import io.netty.handler.codec.http.FullHttpResponse;
          +import io.netty.handler.codec.http.HttpHeaders;
          +import io.netty.handler.codec.http.HttpRequest;
          +import io.netty.handler.codec.http.HttpResponse;
          +import io.netty.handler.codec.http.HttpResponseStatus;
          +import io.netty.handler.codec.http.HttpServerCodec;
          +import io.netty.handler.codec.http.LastHttpContent;
          +import io.netty.handler.codec.http.router.Handler;
          +import io.netty.handler.codec.http.router.Routed;
          +import io.netty.handler.codec.http.router.Router;
          +import io.netty.util.CharsetUtil;
          +import org.jets3t.service.utils.Mimetypes;
          +import org.slf4j.Logger;
          +import org.slf4j.LoggerFactory;
          +
          +import java.io.File;
          +import java.io.FileNotFoundException;
          +import java.io.RandomAccessFile;
          +import java.net.InetSocketAddress;
          +import java.net.MalformedURLException;
          +import java.net.URL;
          +
          +import static io.netty.handler.codec.http.HttpHeaders.Names.CACHE_CONTROL;
          +import static io.netty.handler.codec.http.HttpHeaders.Names.CONNECTION;
          +import static io.netty.handler.codec.http.HttpHeaders.Names.CONTENT_TYPE;
          +import static io.netty.handler.codec.http.HttpMethod.GET;
          +import static io.netty.handler.codec.http.HttpMethod.HEAD;
          +import static io.netty.handler.codec.http.HttpResponseStatus.GONE;
          +import static io.netty.handler.codec.http.HttpResponseStatus.INTERNAL_SERVER_ERROR;
          +import static io.netty.handler.codec.http.HttpResponseStatus.METHOD_NOT_ALLOWED;
          +import static io.netty.handler.codec.http.HttpResponseStatus.NOT_FOUND;
          +import static io.netty.handler.codec.http.HttpResponseStatus.OK;
          +import static io.netty.handler.codec.http.HttpVersion.HTTP_1_1;
          +
          +
          +/**
          + * A generic Mesos artifact server, designed specifically for use by the Mesos Fetcher.
          + *
          + * More information:
          + * http://mesos.apache.org/documentation/latest/fetcher/
          + * http://mesos.apache.org/documentation/latest/fetcher-cache-internals/
          + */
          +public class MesosArtifactServer {
          +
          + private static final Logger LOG = LoggerFactory.getLogger(MesosArtifactServer.class);
          +
          + private final Router router;
          +
          + private ServerBootstrap bootstrap;
          +
          + private Channel serverChannel;
          +
          + private URL baseURL;
          — End diff –

          Will tackle later in follow-up task, since I am making changes to the artifact server for the dispatcher's purposes.

          Show
          githubbot ASF GitHub Bot added a comment - Github user EronWright commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75395325 — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/util/MesosArtifactServer.java — @@ -0,0 +1,304 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.util; + +import io.netty.bootstrap.ServerBootstrap; +import io.netty.buffer.Unpooled; +import io.netty.channel.Channel; +import io.netty.channel.ChannelFuture; +import io.netty.channel.ChannelFutureListener; +import io.netty.channel.ChannelHandler; +import io.netty.channel.ChannelHandlerContext; +import io.netty.channel.ChannelInitializer; +import io.netty.channel.DefaultFileRegion; +import io.netty.channel.SimpleChannelInboundHandler; +import io.netty.channel.nio.NioEventLoopGroup; +import io.netty.channel.socket.SocketChannel; +import io.netty.channel.socket.nio.NioServerSocketChannel; +import io.netty.handler.codec.http.DefaultFullHttpResponse; +import io.netty.handler.codec.http.DefaultHttpResponse; +import io.netty.handler.codec.http.FullHttpResponse; +import io.netty.handler.codec.http.HttpHeaders; +import io.netty.handler.codec.http.HttpRequest; +import io.netty.handler.codec.http.HttpResponse; +import io.netty.handler.codec.http.HttpResponseStatus; +import io.netty.handler.codec.http.HttpServerCodec; +import io.netty.handler.codec.http.LastHttpContent; +import io.netty.handler.codec.http.router.Handler; +import io.netty.handler.codec.http.router.Routed; +import io.netty.handler.codec.http.router.Router; +import io.netty.util.CharsetUtil; +import org.jets3t.service.utils.Mimetypes; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.File; +import java.io.FileNotFoundException; +import java.io.RandomAccessFile; +import java.net.InetSocketAddress; +import java.net.MalformedURLException; +import java.net.URL; + +import static io.netty.handler.codec.http.HttpHeaders.Names.CACHE_CONTROL; +import static io.netty.handler.codec.http.HttpHeaders.Names.CONNECTION; +import static io.netty.handler.codec.http.HttpHeaders.Names.CONTENT_TYPE; +import static io.netty.handler.codec.http.HttpMethod.GET; +import static io.netty.handler.codec.http.HttpMethod.HEAD; +import static io.netty.handler.codec.http.HttpResponseStatus.GONE; +import static io.netty.handler.codec.http.HttpResponseStatus.INTERNAL_SERVER_ERROR; +import static io.netty.handler.codec.http.HttpResponseStatus.METHOD_NOT_ALLOWED; +import static io.netty.handler.codec.http.HttpResponseStatus.NOT_FOUND; +import static io.netty.handler.codec.http.HttpResponseStatus.OK; +import static io.netty.handler.codec.http.HttpVersion.HTTP_1_1; + + +/** + * A generic Mesos artifact server, designed specifically for use by the Mesos Fetcher. + * + * More information: + * http://mesos.apache.org/documentation/latest/fetcher/ + * http://mesos.apache.org/documentation/latest/fetcher-cache-internals/ + */ +public class MesosArtifactServer { + + private static final Logger LOG = LoggerFactory.getLogger(MesosArtifactServer.class); + + private final Router router; + + private ServerBootstrap bootstrap; + + private Channel serverChannel; + + private URL baseURL; — End diff – Will tackle later in follow-up task, since I am making changes to the artifact server for the dispatcher's purposes.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user EronWright commented on the issue:

          https://github.com/apache/flink/pull/2315

          Thanks for the review @tillrohrmann @mxm and @StephanEwen. I'm addressing the feedback with a follow-up commit to be completed ASAP.

          Show
          githubbot ASF GitHub Bot added a comment - Github user EronWright commented on the issue: https://github.com/apache/flink/pull/2315 Thanks for the review @tillrohrmann @mxm and @StephanEwen. I'm addressing the feedback with a follow-up commit to be completed ASAP.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user EronWright commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75335049

          — Diff: flink-mesos/src/test/scala/org/apache/flink/mesos/scheduler/LaunchCoordinatorTest.scala —
          @@ -0,0 +1,439 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.scheduler
          +
          +import java.util.

          {Collections, UUID}

          +import java.util.concurrent.atomic.AtomicReference
          +
          +import akka.actor.FSM.StateTimeout
          +import akka.testkit._
          +import com.netflix.fenzo.TaskRequest.

          {AssignedResources, NamedResourceSetRequest}

          +import com.netflix.fenzo._
          +import com.netflix.fenzo.functions.

          {Action1, Action2}

          +import com.netflix.fenzo.plugins.VMLeaseObject
          +import org.apache.flink.api.java.tuple.

          {Tuple2=>FlinkTuple2}

          +import org.apache.flink.configuration.Configuration
          +import org.apache.flink.mesos.scheduler.LaunchCoordinator._
          +import org.apache.flink.mesos.scheduler.messages._
          +import org.apache.flink.runtime.akka.AkkaUtils
          +import org.apache.mesos.Protos.

          {SlaveID, TaskInfo}

          +import org.apache.mesos.

          {SchedulerDriver, Protos}

          +import org.junit.runner.RunWith
          +import org.mockito.Mockito.

          {verify, _}

          +import org.mockito.invocation.InvocationOnMock
          +import org.mockito.stubbing.Answer
          +import org.mockito.

          {Matchers => MM, Mockito}

          +import org.scalatest.junit.JUnitRunner
          +import org.scalatest.

          {BeforeAndAfterAll, Matchers, WordSpecLike}

          +
          +import scala.collection.JavaConverters._
          +
          +import org.apache.flink.mesos.Utils.range
          +import org.apache.flink.mesos.Utils.ranges
          +import org.apache.flink.mesos.Utils.scalar
          +
          +@RunWith(classOf[JUnitRunner])
          +class LaunchCoordinatorTest
          + extends TestKitBase
          + with ImplicitSender
          + with WordSpecLike
          + with Matchers
          + with BeforeAndAfterAll {
          +
          + lazy val config = new Configuration()
          + implicit lazy val system = AkkaUtils.createLocalActorSystem(config)
          +
          + override def afterAll(): Unit =

          { + TestKit.shutdownActorSystem(system) + }

          +
          + def randomFramework =

          { + Protos.FrameworkID.newBuilder().setValue(UUID.randomUUID.toString).build + }

          +
          + def randomTask = {
          + val taskID = Protos.TaskID.newBuilder.setValue(UUID.randomUUID.toString).build
          +
          + def generateTaskRequest = {
          + new TaskRequest() {
          + private[mesos] val assignedResources = new AtomicReference[TaskRequest.AssignedResources]
          + override def getId: String = taskID.getValue
          + override def taskGroupName: String = ""
          + override def getCPUs: Double = 1.0
          + override def getMemory: Double = 1024.0
          + override def getNetworkMbps: Double = 0.0
          + override def getDisk: Double = 0.0
          + override def getPorts: Int = 1
          + override def getCustomNamedResources: java.util.Map[String, NamedResourceSetRequest] =
          + Collections.emptyMap[String, NamedResourceSetRequest]
          + override def getSoftConstraints: java.util.List[_ <: VMTaskFitnessCalculator] = null
          + override def getHardConstraints: java.util.List[_ <: ConstraintEvaluator] = null
          + override def getAssignedResources: AssignedResources = assignedResources.get()
          + override def setAssignedResources(assignedResources: AssignedResources): Unit =

          { + this.assignedResources.set(assignedResources) + }

          + }
          + }
          +
          + val task: LaunchableTask = new LaunchableTask() {
          + override def taskRequest: TaskRequest = generateTaskRequest
          + override def launch(slaveId: SlaveID, taskAssignment: TaskAssignmentResult): Protos.TaskInfo =

          { + Protos.TaskInfo.newBuilder + .setTaskId(taskID).setName(taskID.getValue) + .setCommand(Protos.CommandInfo.newBuilder.setValue("whoami")) + .setSlaveId(slaveId) + .build() + }

          + override def toString = taskRequest.getId
          + }
          +
          + (taskID, task)
          + }
          +
          + def randomSlave = {
          + val slaveID = Protos.SlaveID.newBuilder.setValue(UUID.randomUUID.toString).build
          + val hostname = s"host-$

          {slaveID.getValue}

          "
          + (slaveID, hostname)
          + }
          +
          + def randomOffer(frameworkID: Protos.FrameworkID, slave: (Protos.SlaveID, String)) =

          { + val offerID = Protos.OfferID.newBuilder().setValue(UUID.randomUUID.toString) + Protos.Offer.newBuilder() + .setFrameworkId(frameworkID) + .setId(offerID) + .setSlaveId(slave._1) + .setHostname(slave._2) + .addResources(scalar("cpus", 0.75)) + .addResources(scalar("mem", 4096.0)) + .addResources(scalar("disk", 1024.0)) + .addResources(ranges("ports", range(9000, 9001))) + .build() + }

          +
          + def lease(offer: Protos.Offer) =

          { + new VMLeaseObject(offer) + }

          +
          + /**
          + * Mock a successful task assignment result matching a task to an offer.
          + */
          + def taskAssignmentResult(lease: VirtualMachineLease, task: TaskRequest): TaskAssignmentResult =

          { + val ports = lease.portRanges().get(0) + val r = mock(classOf[TaskAssignmentResult]) + when(r.getTaskId).thenReturn(task.getId) + when(r.getHostname).thenReturn(lease.hostname()) + when(r.getAssignedPorts).thenReturn( + (ports.getBeg to ports.getBeg + task.getPorts).toList.asJava.asInstanceOf[java.util.List[Integer]]) + when(r.getRequest).thenReturn(task) + when(r.isSuccessful).thenReturn(true) + when(r.getFitness).thenReturn(1.0) + r + }

          +
          + /**
          + * Mock a VM assignment result with the given leases and tasks.
          + */
          + def vmAssignmentResult(hostname: String,
          + leasesUsed: Seq[VirtualMachineLease],
          + tasksAssigned: Set[TaskAssignmentResult]): VMAssignmentResult =

          { + new VMAssignmentResult(hostname, leasesUsed.asJava, tasksAssigned.asJava) + }

          +
          + /**
          + * Mock a scheduling result with the given successes and failures.
          + */
          + def schedulingResult(successes: Seq[VMAssignmentResult],
          + failures: Seq[TaskAssignmentResult] = Nil,
          + exceptions: Seq[Exception] = Nil,
          + leasesAdded: Int = 0,
          + leasesRejected: Int = 0): SchedulingResult =

          { + val r = mock(classOf[SchedulingResult]) + when(r.getResultMap).thenReturn(successes.map(r => r.getHostname -> r).toMap.asJava) + when(r.getExceptions).thenReturn(exceptions.asJava) + val groupedFailures = failures.groupBy(_.getRequest).mapValues(_.asJava) + when(r.getFailures).thenReturn(groupedFailures.asJava) + when(r.getLeasesAdded).thenReturn(leasesAdded) + when(r.getLeasesRejected).thenReturn(leasesRejected) + when(r.getRuntime).thenReturn(0) + when(r.getNumAllocations).thenThrow(new NotImplementedError()) + when(r.getTotalVMsCount).thenThrow(new NotImplementedError()) + when(r.getIdleVMsCount).thenThrow(new NotImplementedError()) + r + }

          +
          +
          + /**
          + * Mock a task scheduler.
          + * The task assigner/unassigner is pre-wired.
          + */
          + def taskScheduler() =

          { + val optimizer = mock(classOf[TaskScheduler]) + val taskAssigner = mock(classOf[Action2[TaskRequest, String]]) + when[Action2[TaskRequest, String]](optimizer.getTaskAssigner).thenReturn(taskAssigner) + val taskUnassigner = mock(classOf[Action2[String, String]]) + when[Action2[String, String]](optimizer.getTaskUnAssigner).thenReturn(taskUnassigner) + optimizer + }

          +
          + /**
          + * Create a task scheduler builder.
          + */
          + def taskSchedulerBuilder(optimizer: TaskScheduler) = new TaskSchedulerBuilder {
          + var leaseRejectAction: Action1[VirtualMachineLease] = null
          + override def withLeaseRejectAction(action: Action1[VirtualMachineLease]): TaskSchedulerBuilder =

          { + leaseRejectAction = action + this + }

          + override def build(): TaskScheduler = optimizer
          + }
          +
          + /**
          + * Process a call to scheduleOnce with the given function.
          + */
          + def scheduleOnce(f: (Seq[TaskRequest],Seq[VirtualMachineLease]) => SchedulingResult) = {
          + new Answer[SchedulingResult] {
          + override def answer(invocationOnMock: InvocationOnMock): SchedulingResult =

          { + val args = invocationOnMock.getArguments + val requests = args(0).asInstanceOf[java.util.List[TaskRequest]] + val newLeases = args(1).asInstanceOf[java.util.List[VirtualMachineLease]] + f(requests.asScala, newLeases.asScala) + }

          + }
          + }
          +
          + /**
          + * The context fixture.
          + */
          + class Context {
          + val optimizer = taskScheduler()
          + val optimizerBuilder = taskSchedulerBuilder(optimizer)
          + val schedulerDriver = mock(classOf[SchedulerDriver])
          + val trace = Mockito.inOrder(schedulerDriver)
          + val fsm = TestFSMRef(new LaunchCoordinator(testActor, config, schedulerDriver, optimizerBuilder))
          +
          + val framework = randomFramework
          + val task1 = randomTask
          + val task2 = randomTask
          + val task3 = randomTask
          +
          + val slave1 =

          { + val slave = randomSlave + (slave._1, slave._2, randomOffer(framework, slave), randomOffer(framework, slave), randomOffer(framework, slave)) + }
          +
          + val slave2 = { + val slave = randomSlave + (slave._1, slave._2, randomOffer(framework, slave), randomOffer(framework, slave), randomOffer(framework, slave)) + }

          + }
          +
          + def inState = afterWord("in state")
          + def handle = afterWord("handle")
          +
          + def handlesAssignments(state: TaskState) = {
          + "Unassign" which {
          + s"stays in $state with updated optimizer state" in new Context

          { + optimizer.getTaskAssigner.call(task1._2.taskRequest, slave1._2) + fsm.setState(state) + fsm ! Unassign(task1._1, slave1._2) + verify(optimizer.getTaskUnAssigner).call(task1._1.getValue, slave1._2) + fsm.stateName should be (state) + }

          + }
          + "Assign" which {
          + s"stays in $state with updated optimizer state" in new Context

          { + fsm.setState(state) + fsm ! Assign(Seq(new FlinkTuple2(task1._2.taskRequest, slave1._2)).asJava) + verify(optimizer.getTaskAssigner).call(MM.any(), MM.any()) + fsm.stateName should be (state) + }

          + }
          + }
          +
          + "The LaunchCoordinator" when inState {
          +
          + "Suspended" should handle {
          — End diff –

          Correct, the text alone doesn't set the initial state. The test bodies call `fsm.setState` to establish the initial state (along with corresponding state data).

          I found the wordspec DSL to be a good fit for Akka FSM testing, because it can be made to closely match the FSM concept of being in some state and handling specific events. The wordspec test output is delightful, with a nice tree structure in the IDE and with nice sentences in the console.

          Show
          githubbot ASF GitHub Bot added a comment - Github user EronWright commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75335049 — Diff: flink-mesos/src/test/scala/org/apache/flink/mesos/scheduler/LaunchCoordinatorTest.scala — @@ -0,0 +1,439 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.scheduler + +import java.util. {Collections, UUID} +import java.util.concurrent.atomic.AtomicReference + +import akka.actor.FSM.StateTimeout +import akka.testkit._ +import com.netflix.fenzo.TaskRequest. {AssignedResources, NamedResourceSetRequest} +import com.netflix.fenzo._ +import com.netflix.fenzo.functions. {Action1, Action2} +import com.netflix.fenzo.plugins.VMLeaseObject +import org.apache.flink.api.java.tuple. {Tuple2=>FlinkTuple2} +import org.apache.flink.configuration.Configuration +import org.apache.flink.mesos.scheduler.LaunchCoordinator._ +import org.apache.flink.mesos.scheduler.messages._ +import org.apache.flink.runtime.akka.AkkaUtils +import org.apache.mesos.Protos. {SlaveID, TaskInfo} +import org.apache.mesos. {SchedulerDriver, Protos} +import org.junit.runner.RunWith +import org.mockito.Mockito. {verify, _} +import org.mockito.invocation.InvocationOnMock +import org.mockito.stubbing.Answer +import org.mockito. {Matchers => MM, Mockito} +import org.scalatest.junit.JUnitRunner +import org.scalatest. {BeforeAndAfterAll, Matchers, WordSpecLike} + +import scala.collection.JavaConverters._ + +import org.apache.flink.mesos.Utils.range +import org.apache.flink.mesos.Utils.ranges +import org.apache.flink.mesos.Utils.scalar + +@RunWith(classOf [JUnitRunner] ) +class LaunchCoordinatorTest + extends TestKitBase + with ImplicitSender + with WordSpecLike + with Matchers + with BeforeAndAfterAll { + + lazy val config = new Configuration() + implicit lazy val system = AkkaUtils.createLocalActorSystem(config) + + override def afterAll(): Unit = { + TestKit.shutdownActorSystem(system) + } + + def randomFramework = { + Protos.FrameworkID.newBuilder().setValue(UUID.randomUUID.toString).build + } + + def randomTask = { + val taskID = Protos.TaskID.newBuilder.setValue(UUID.randomUUID.toString).build + + def generateTaskRequest = { + new TaskRequest() { + private [mesos] val assignedResources = new AtomicReference [TaskRequest.AssignedResources] + override def getId: String = taskID.getValue + override def taskGroupName: String = "" + override def getCPUs: Double = 1.0 + override def getMemory: Double = 1024.0 + override def getNetworkMbps: Double = 0.0 + override def getDisk: Double = 0.0 + override def getPorts: Int = 1 + override def getCustomNamedResources: java.util.Map [String, NamedResourceSetRequest] = + Collections.emptyMap [String, NamedResourceSetRequest] + override def getSoftConstraints: java.util.List [_ <: VMTaskFitnessCalculator] = null + override def getHardConstraints: java.util.List [_ <: ConstraintEvaluator] = null + override def getAssignedResources: AssignedResources = assignedResources.get() + override def setAssignedResources(assignedResources: AssignedResources): Unit = { + this.assignedResources.set(assignedResources) + } + } + } + + val task: LaunchableTask = new LaunchableTask() { + override def taskRequest: TaskRequest = generateTaskRequest + override def launch(slaveId: SlaveID, taskAssignment: TaskAssignmentResult): Protos.TaskInfo = { + Protos.TaskInfo.newBuilder + .setTaskId(taskID).setName(taskID.getValue) + .setCommand(Protos.CommandInfo.newBuilder.setValue("whoami")) + .setSlaveId(slaveId) + .build() + } + override def toString = taskRequest.getId + } + + (taskID, task) + } + + def randomSlave = { + val slaveID = Protos.SlaveID.newBuilder.setValue(UUID.randomUUID.toString).build + val hostname = s"host-$ {slaveID.getValue} " + (slaveID, hostname) + } + + def randomOffer(frameworkID: Protos.FrameworkID, slave: (Protos.SlaveID, String)) = { + val offerID = Protos.OfferID.newBuilder().setValue(UUID.randomUUID.toString) + Protos.Offer.newBuilder() + .setFrameworkId(frameworkID) + .setId(offerID) + .setSlaveId(slave._1) + .setHostname(slave._2) + .addResources(scalar("cpus", 0.75)) + .addResources(scalar("mem", 4096.0)) + .addResources(scalar("disk", 1024.0)) + .addResources(ranges("ports", range(9000, 9001))) + .build() + } + + def lease(offer: Protos.Offer) = { + new VMLeaseObject(offer) + } + + /** + * Mock a successful task assignment result matching a task to an offer. + */ + def taskAssignmentResult(lease: VirtualMachineLease, task: TaskRequest): TaskAssignmentResult = { + val ports = lease.portRanges().get(0) + val r = mock(classOf[TaskAssignmentResult]) + when(r.getTaskId).thenReturn(task.getId) + when(r.getHostname).thenReturn(lease.hostname()) + when(r.getAssignedPorts).thenReturn( + (ports.getBeg to ports.getBeg + task.getPorts).toList.asJava.asInstanceOf[java.util.List[Integer]]) + when(r.getRequest).thenReturn(task) + when(r.isSuccessful).thenReturn(true) + when(r.getFitness).thenReturn(1.0) + r + } + + /** + * Mock a VM assignment result with the given leases and tasks. + */ + def vmAssignmentResult(hostname: String, + leasesUsed: Seq [VirtualMachineLease] , + tasksAssigned: Set [TaskAssignmentResult] ): VMAssignmentResult = { + new VMAssignmentResult(hostname, leasesUsed.asJava, tasksAssigned.asJava) + } + + /** + * Mock a scheduling result with the given successes and failures. + */ + def schedulingResult(successes: Seq [VMAssignmentResult] , + failures: Seq [TaskAssignmentResult] = Nil, + exceptions: Seq [Exception] = Nil, + leasesAdded: Int = 0, + leasesRejected: Int = 0): SchedulingResult = { + val r = mock(classOf[SchedulingResult]) + when(r.getResultMap).thenReturn(successes.map(r => r.getHostname -> r).toMap.asJava) + when(r.getExceptions).thenReturn(exceptions.asJava) + val groupedFailures = failures.groupBy(_.getRequest).mapValues(_.asJava) + when(r.getFailures).thenReturn(groupedFailures.asJava) + when(r.getLeasesAdded).thenReturn(leasesAdded) + when(r.getLeasesRejected).thenReturn(leasesRejected) + when(r.getRuntime).thenReturn(0) + when(r.getNumAllocations).thenThrow(new NotImplementedError()) + when(r.getTotalVMsCount).thenThrow(new NotImplementedError()) + when(r.getIdleVMsCount).thenThrow(new NotImplementedError()) + r + } + + + /** + * Mock a task scheduler. + * The task assigner/unassigner is pre-wired. + */ + def taskScheduler() = { + val optimizer = mock(classOf[TaskScheduler]) + val taskAssigner = mock(classOf[Action2[TaskRequest, String]]) + when[Action2[TaskRequest, String]](optimizer.getTaskAssigner).thenReturn(taskAssigner) + val taskUnassigner = mock(classOf[Action2[String, String]]) + when[Action2[String, String]](optimizer.getTaskUnAssigner).thenReturn(taskUnassigner) + optimizer + } + + /** + * Create a task scheduler builder. + */ + def taskSchedulerBuilder(optimizer: TaskScheduler) = new TaskSchedulerBuilder { + var leaseRejectAction: Action1 [VirtualMachineLease] = null + override def withLeaseRejectAction(action: Action1 [VirtualMachineLease] ): TaskSchedulerBuilder = { + leaseRejectAction = action + this + } + override def build(): TaskScheduler = optimizer + } + + /** + * Process a call to scheduleOnce with the given function. + */ + def scheduleOnce(f: (Seq [TaskRequest] ,Seq [VirtualMachineLease] ) => SchedulingResult) = { + new Answer [SchedulingResult] { + override def answer(invocationOnMock: InvocationOnMock): SchedulingResult = { + val args = invocationOnMock.getArguments + val requests = args(0).asInstanceOf[java.util.List[TaskRequest]] + val newLeases = args(1).asInstanceOf[java.util.List[VirtualMachineLease]] + f(requests.asScala, newLeases.asScala) + } + } + } + + /** + * The context fixture. + */ + class Context { + val optimizer = taskScheduler() + val optimizerBuilder = taskSchedulerBuilder(optimizer) + val schedulerDriver = mock(classOf [SchedulerDriver] ) + val trace = Mockito.inOrder(schedulerDriver) + val fsm = TestFSMRef(new LaunchCoordinator(testActor, config, schedulerDriver, optimizerBuilder)) + + val framework = randomFramework + val task1 = randomTask + val task2 = randomTask + val task3 = randomTask + + val slave1 = { + val slave = randomSlave + (slave._1, slave._2, randomOffer(framework, slave), randomOffer(framework, slave), randomOffer(framework, slave)) + } + + val slave2 = { + val slave = randomSlave + (slave._1, slave._2, randomOffer(framework, slave), randomOffer(framework, slave), randomOffer(framework, slave)) + } + } + + def inState = afterWord("in state") + def handle = afterWord("handle") + + def handlesAssignments(state: TaskState) = { + "Unassign" which { + s"stays in $state with updated optimizer state" in new Context { + optimizer.getTaskAssigner.call(task1._2.taskRequest, slave1._2) + fsm.setState(state) + fsm ! Unassign(task1._1, slave1._2) + verify(optimizer.getTaskUnAssigner).call(task1._1.getValue, slave1._2) + fsm.stateName should be (state) + } + } + "Assign" which { + s"stays in $state with updated optimizer state" in new Context { + fsm.setState(state) + fsm ! Assign(Seq(new FlinkTuple2(task1._2.taskRequest, slave1._2)).asJava) + verify(optimizer.getTaskAssigner).call(MM.any(), MM.any()) + fsm.stateName should be (state) + } + } + } + + "The LaunchCoordinator" when inState { + + "Suspended" should handle { — End diff – Correct, the text alone doesn't set the initial state. The test bodies call `fsm.setState` to establish the initial state (along with corresponding state data). I found the wordspec DSL to be a good fit for Akka FSM testing, because it can be made to closely match the FSM concept of being in some state and handling specific events. The wordspec test output is delightful, with a nice tree structure in the IDE and with nice sentences in the console.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user EronWright commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75333536

          — Diff: flink-mesos/src/main/scala/org/apache/flink/mesos/scheduler/Tasks.scala —
          @@ -0,0 +1,114 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.scheduler
          +
          +import akka.actor.

          {Actor, ActorRef, Props}

          +import org.apache.flink.configuration.Configuration
          +import org.apache.flink.mesos.scheduler.ReconciliationCoordinator.Reconcile
          +import org.apache.flink.mesos.scheduler.TaskMonitor.

          {TaskGoalState, TaskGoalStateUpdated, TaskTerminated}

          +import org.apache.flink.mesos.scheduler.Tasks._
          +import org.apache.flink.mesos.scheduler.messages._
          +import org.apache.mesos.

          {SchedulerDriver, Protos}

          +
          +import scala.collection.mutable.

          {Map => MutableMap}

          +
          +/**
          + * Aggregate of monitored tasks.
          + *
          + * Routes messages between the scheduler and individual task monitor actors.
          + */
          +class Tasks[M <: TaskMonitor](
          + flinkConfig: Configuration,
          + schedulerDriver: SchedulerDriver,
          + taskMonitorClass: Class[M]) extends Actor {
          +
          + /**
          + * A map of task monitors by task ID.
          + */
          + private val taskMap: MutableMap[Protos.TaskID,ActorRef] = MutableMap()
          +
          + /**
          + * Cache of current connection state.
          + */
          + private var registered: Option[Any] = None
          +
          + override def preStart(): Unit =

          { + // TODO subscribe to context.system.deadLetters for messages to nonexistent tasks + }

          +
          + override def receive: Receive = {
          +
          + case msg: Disconnected =>
          + registered = None
          + context.actorSelection("*").tell(msg, self)
          +
          + case msg : Connected =>
          + registered = Some(msg)
          + context.actorSelection("*").tell(msg, self)
          +
          + case msg: TaskGoalStateUpdated =>
          + val taskID = msg.state.taskID
          +
          + // ensure task monitor exists
          + if(!taskMap.contains(taskID))

          { + val actorRef = createTask(msg.state) + registered.foreach(actorRef ! _) + }

          +
          + taskMap(taskID) ! msg
          +
          + case msg: StatusUpdate =>
          + taskMap(msg.status().getTaskId) ! msg
          +
          + case msg: Reconcile =>
          + context.parent.forward(msg)
          — End diff –

          Actually this code is generic; at best we could say it is the scheduler (which is actually the RM or the dispatcher).

          I think the argument is stronger for this reference to be explicit than for the TaskMonitor case, because `Tasks` and `TaskMonitor` are coupled by design (one is the aggregate for the other). I'll fix this reference if I have time.

          Show
          githubbot ASF GitHub Bot added a comment - Github user EronWright commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75333536 — Diff: flink-mesos/src/main/scala/org/apache/flink/mesos/scheduler/Tasks.scala — @@ -0,0 +1,114 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.scheduler + +import akka.actor. {Actor, ActorRef, Props} +import org.apache.flink.configuration.Configuration +import org.apache.flink.mesos.scheduler.ReconciliationCoordinator.Reconcile +import org.apache.flink.mesos.scheduler.TaskMonitor. {TaskGoalState, TaskGoalStateUpdated, TaskTerminated} +import org.apache.flink.mesos.scheduler.Tasks._ +import org.apache.flink.mesos.scheduler.messages._ +import org.apache.mesos. {SchedulerDriver, Protos} + +import scala.collection.mutable. {Map => MutableMap} + +/** + * Aggregate of monitored tasks. + * + * Routes messages between the scheduler and individual task monitor actors. + */ +class Tasks [M <: TaskMonitor] ( + flinkConfig: Configuration, + schedulerDriver: SchedulerDriver, + taskMonitorClass: Class [M] ) extends Actor { + + /** + * A map of task monitors by task ID. + */ + private val taskMap: MutableMap [Protos.TaskID,ActorRef] = MutableMap() + + /** + * Cache of current connection state. + */ + private var registered: Option [Any] = None + + override def preStart(): Unit = { + // TODO subscribe to context.system.deadLetters for messages to nonexistent tasks + } + + override def receive: Receive = { + + case msg: Disconnected => + registered = None + context.actorSelection("*").tell(msg, self) + + case msg : Connected => + registered = Some(msg) + context.actorSelection("*").tell(msg, self) + + case msg: TaskGoalStateUpdated => + val taskID = msg.state.taskID + + // ensure task monitor exists + if(!taskMap.contains(taskID)) { + val actorRef = createTask(msg.state) + registered.foreach(actorRef ! _) + } + + taskMap(taskID) ! msg + + case msg: StatusUpdate => + taskMap(msg.status().getTaskId) ! msg + + case msg: Reconcile => + context.parent.forward(msg) — End diff – Actually this code is generic; at best we could say it is the scheduler (which is actually the RM or the dispatcher). I think the argument is stronger for this reference to be explicit than for the TaskMonitor case, because `Tasks` and `TaskMonitor` are coupled by design (one is the aggregate for the other). I'll fix this reference if I have time.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user EronWright commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75332580

          — Diff: flink-mesos/src/main/scala/org/apache/flink/mesos/scheduler/TaskMonitor.scala —
          @@ -0,0 +1,258 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.scheduler
          +
          +import grizzled.slf4j.Logger
          +
          +import akka.actor.

          {Actor, FSM, Props}

          +import org.apache.flink.configuration.Configuration
          +import org.apache.flink.mesos.scheduler.ReconciliationCoordinator.Reconcile
          +import org.apache.flink.mesos.scheduler.TaskMonitor._
          +import org.apache.flink.mesos.scheduler.messages.

          {Connected, Disconnected, StatusUpdate}

          +import org.apache.mesos.Protos.TaskState._
          +import org.apache.mesos.

          {SchedulerDriver, Protos}

          +
          +import scala.PartialFunction.empty
          +import scala.concurrent.duration._
          +
          +/**
          + * Monitors a Mesos task throughout its lifecycle.
          + *
          + * Models a task with a state machine reflecting the perceived state of the task in Mesos. The state
          + * is primarily updated when task status information arrives from Mesos.
          + *
          + * The associated state data primarily tracks the task's goal (intended) state, as persisted by the scheduler.
          + * Keep in mind that goal state is persisted before actions are taken. The goal state strictly transitions
          + * thru New->Launched->Released.
          + *
          + * Unlike most exchanges with Mesos, task status is delivered at-least-once, so status handling should be idempotent.
          + */
          +class TaskMonitor(
          + flinkConfig: Configuration,
          + schedulerDriver: SchedulerDriver,
          + goalState: TaskGoalState) extends Actor with FSM[TaskMonitorState,StateData] {
          +
          + val LOG = Logger(getClass)
          +
          + startWith(Suspended, StateData(goalState))
          +
          + // ------------------------------------------------------------------------
          + // Suspended State
          + // ------------------------------------------------------------------------
          +
          + when(Suspended)

          { + case Event(update: TaskGoalStateUpdated, _) => + stay() using StateData(update.state) + case Event(msg: StatusUpdate, _) => + stay() + case Event(msg: Connected, StateData(goal: New)) => + goto(New) + case Event(msg: Connected, StateData(goal: Launched)) => + goto(Reconciling) + case Event(msg: Connected, StateData(goal: Released)) => + goto(Killing) + }

          +
          + // ------------------------------------------------------------------------
          + // New State
          + // ------------------------------------------------------------------------
          +
          + when(New)

          { + case Event(TaskGoalStateUpdated(goal: Launched), _) => + goto(Staging) using StateData(goal) + }

          +
          + // ------------------------------------------------------------------------
          + // Reconciliation State
          + // ------------------------------------------------------------------------
          +
          + onTransition {
          + case _ -> Reconciling =>
          + nextStateData.goal match {
          + case goal: Launched =>
          + val taskStatus = Protos.TaskStatus.newBuilder()
          + .setTaskId(goal.taskID).setSlaveId(goal.slaveID).setState(TASK_STAGING).build()
          + context.parent ! Reconcile(Seq(taskStatus))
          — End diff –

          I agree that the alternative to using the implicit parent reference is to always use an explicit reference. It can be unit tested either way. I actually considered the latter but couldn't think of a nice name for the explicit reference.

          Show
          githubbot ASF GitHub Bot added a comment - Github user EronWright commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75332580 — Diff: flink-mesos/src/main/scala/org/apache/flink/mesos/scheduler/TaskMonitor.scala — @@ -0,0 +1,258 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.scheduler + +import grizzled.slf4j.Logger + +import akka.actor. {Actor, FSM, Props} +import org.apache.flink.configuration.Configuration +import org.apache.flink.mesos.scheduler.ReconciliationCoordinator.Reconcile +import org.apache.flink.mesos.scheduler.TaskMonitor._ +import org.apache.flink.mesos.scheduler.messages. {Connected, Disconnected, StatusUpdate} +import org.apache.mesos.Protos.TaskState._ +import org.apache.mesos. {SchedulerDriver, Protos} + +import scala.PartialFunction.empty +import scala.concurrent.duration._ + +/** + * Monitors a Mesos task throughout its lifecycle. + * + * Models a task with a state machine reflecting the perceived state of the task in Mesos. The state + * is primarily updated when task status information arrives from Mesos. + * + * The associated state data primarily tracks the task's goal (intended) state, as persisted by the scheduler. + * Keep in mind that goal state is persisted before actions are taken. The goal state strictly transitions + * thru New->Launched->Released. + * + * Unlike most exchanges with Mesos, task status is delivered at-least-once, so status handling should be idempotent. + */ +class TaskMonitor( + flinkConfig: Configuration, + schedulerDriver: SchedulerDriver, + goalState: TaskGoalState) extends Actor with FSM [TaskMonitorState,StateData] { + + val LOG = Logger(getClass) + + startWith(Suspended, StateData(goalState)) + + // ------------------------------------------------------------------------ + // Suspended State + // ------------------------------------------------------------------------ + + when(Suspended) { + case Event(update: TaskGoalStateUpdated, _) => + stay() using StateData(update.state) + case Event(msg: StatusUpdate, _) => + stay() + case Event(msg: Connected, StateData(goal: New)) => + goto(New) + case Event(msg: Connected, StateData(goal: Launched)) => + goto(Reconciling) + case Event(msg: Connected, StateData(goal: Released)) => + goto(Killing) + } + + // ------------------------------------------------------------------------ + // New State + // ------------------------------------------------------------------------ + + when(New) { + case Event(TaskGoalStateUpdated(goal: Launched), _) => + goto(Staging) using StateData(goal) + } + + // ------------------------------------------------------------------------ + // Reconciliation State + // ------------------------------------------------------------------------ + + onTransition { + case _ -> Reconciling => + nextStateData.goal match { + case goal: Launched => + val taskStatus = Protos.TaskStatus.newBuilder() + .setTaskId(goal.taskID).setSlaveId(goal.slaveID).setState(TASK_STAGING).build() + context.parent ! Reconcile(Seq(taskStatus)) — End diff – I agree that the alternative to using the implicit parent reference is to always use an explicit reference. It can be unit tested either way. I actually considered the latter but couldn't think of a nice name for the explicit reference.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user EronWright commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75331224

          — Diff: flink-mesos/src/main/scala/org/apache/flink/mesos/scheduler/ConnectionMonitor.scala —
          @@ -0,0 +1,126 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.scheduler
          +
          +import akka.actor.

          {Actor, FSM, Props}

          +import grizzled.slf4j.Logger
          +import org.apache.flink.configuration.Configuration
          +import org.apache.flink.mesos.scheduler.ConnectionMonitor._
          +import org.apache.flink.mesos.scheduler.messages._
          +
          +import scala.concurrent.duration._
          +
          +/**
          + * Actively monitors the Mesos connection.
          + */
          +class ConnectionMonitor() extends Actor with FSM[FsmState, Unit] {
          +
          + val LOG = Logger(getClass)
          +
          + startWith(StoppedState, None)
          +
          + when(StoppedState)

          { + case Event(msg: Start, _) => + LOG.info(s"Connecting to Mesos...") + goto(ConnectingState) + }

          +
          + when(ConnectingState, stateTimeout = CONNECT_RETRY_RATE) {
          + case Event(msg: Stop, _) =>
          + goto(StoppedState)
          +
          + case Event(msg: Registered, _) =>
          + LOG.info(s"Connected to Mesos as framework ID $

          {msg.frameworkId.getValue}

          .")
          + LOG.debug(s" Master Info: $

          {msg.masterInfo}")
          + goto(ConnectedState)
          +
          + case Event(msg: ReRegistered, _) =>
          + LOG.info("Reconnected to a new Mesos master.")
          + LOG.debug(s" Master Info: ${msg.masterInfo}

          ")
          + goto(ConnectedState)
          +
          + case Event(StateTimeout, _) =>
          + LOG.warn("Unable to connect to Mesos; still trying...")
          + stay()
          + }
          +
          + when(ConnectedState)

          { + case Event(msg: Stop, _) => + goto(StoppedState) + + case Event(msg: Disconnected, _) => + LOG.warn("Disconnected from the Mesos master. Reconnecting...") + goto(ConnectingState) + }

          +
          — End diff –

          My rationale has been to let the default handling occur for unhandled events. When I do use the `whenUnhandled` block, it is for common code. Do tell if you see an event I should be handling, or I otherwise misunderstood your comment.

          Show
          githubbot ASF GitHub Bot added a comment - Github user EronWright commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75331224 — Diff: flink-mesos/src/main/scala/org/apache/flink/mesos/scheduler/ConnectionMonitor.scala — @@ -0,0 +1,126 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.scheduler + +import akka.actor. {Actor, FSM, Props} +import grizzled.slf4j.Logger +import org.apache.flink.configuration.Configuration +import org.apache.flink.mesos.scheduler.ConnectionMonitor._ +import org.apache.flink.mesos.scheduler.messages._ + +import scala.concurrent.duration._ + +/** + * Actively monitors the Mesos connection. + */ +class ConnectionMonitor() extends Actor with FSM [FsmState, Unit] { + + val LOG = Logger(getClass) + + startWith(StoppedState, None) + + when(StoppedState) { + case Event(msg: Start, _) => + LOG.info(s"Connecting to Mesos...") + goto(ConnectingState) + } + + when(ConnectingState, stateTimeout = CONNECT_RETRY_RATE) { + case Event(msg: Stop, _) => + goto(StoppedState) + + case Event(msg: Registered, _) => + LOG.info(s"Connected to Mesos as framework ID $ {msg.frameworkId.getValue} .") + LOG.debug(s" Master Info: $ {msg.masterInfo}") + goto(ConnectedState) + + case Event(msg: ReRegistered, _) => + LOG.info("Reconnected to a new Mesos master.") + LOG.debug(s" Master Info: ${msg.masterInfo} ") + goto(ConnectedState) + + case Event(StateTimeout, _) => + LOG.warn("Unable to connect to Mesos; still trying...") + stay() + } + + when(ConnectedState) { + case Event(msg: Stop, _) => + goto(StoppedState) + + case Event(msg: Disconnected, _) => + LOG.warn("Disconnected from the Mesos master. Reconnecting...") + goto(ConnectingState) + } + — End diff – My rationale has been to let the default handling occur for unhandled events. When I do use the `whenUnhandled` block, it is for common code. Do tell if you see an event I should be handling, or I otherwise misunderstood your comment.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user EronWright commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75325649

          — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/MesosFlinkResourceManager.java —
          @@ -0,0 +1,755 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.runtime.clusterframework;
          +
          +import akka.actor.ActorRef;
          +import akka.actor.Props;
          +import com.netflix.fenzo.TaskRequest;
          +import com.netflix.fenzo.TaskScheduler;
          +import com.netflix.fenzo.VirtualMachineLease;
          +import com.netflix.fenzo.functions.Action1;
          +import org.apache.flink.api.java.tuple.Tuple2;
          +import org.apache.flink.configuration.ConfigConstants;
          +import org.apache.flink.configuration.Configuration;
          +import org.apache.flink.mesos.runtime.clusterframework.store.MesosWorkerStore;
          +import org.apache.flink.mesos.scheduler.ConnectionMonitor;
          +import org.apache.flink.mesos.scheduler.LaunchableTask;
          +import org.apache.flink.mesos.scheduler.LaunchCoordinator;
          +import org.apache.flink.mesos.scheduler.ReconciliationCoordinator;
          +import org.apache.flink.mesos.scheduler.SchedulerProxy;
          +import org.apache.flink.mesos.scheduler.TaskMonitor;
          +import org.apache.flink.mesos.scheduler.TaskSchedulerBuilder;
          +import org.apache.flink.mesos.scheduler.Tasks;
          +import org.apache.flink.mesos.scheduler.messages.AcceptOffers;
          +import org.apache.flink.mesos.scheduler.messages.Disconnected;
          +import org.apache.flink.mesos.scheduler.messages.Error;
          +import org.apache.flink.mesos.scheduler.messages.OfferRescinded;
          +import org.apache.flink.mesos.scheduler.messages.ReRegistered;
          +import org.apache.flink.mesos.scheduler.messages.Registered;
          +import org.apache.flink.mesos.scheduler.messages.ResourceOffers;
          +import org.apache.flink.mesos.scheduler.messages.StatusUpdate;
          +import org.apache.flink.mesos.util.MesosConfiguration;
          +import org.apache.flink.runtime.clusterframework.ApplicationStatus;
          +import org.apache.flink.runtime.clusterframework.FlinkResourceManager;
          +import org.apache.flink.runtime.clusterframework.messages.FatalErrorOccurred;
          +import org.apache.flink.runtime.clusterframework.messages.StopCluster;
          +import org.apache.flink.runtime.clusterframework.types.ResourceID;
          +import org.apache.flink.runtime.leaderretrieval.LeaderRetrievalService;
          +import org.apache.mesos.Protos;
          +import org.apache.mesos.Protos.FrameworkInfo;
          +import org.apache.mesos.SchedulerDriver;
          +import org.slf4j.Logger;
          +import scala.Option;
          +
          +import java.util.ArrayList;
          +import java.util.Collection;
          +import java.util.HashMap;
          +import java.util.List;
          +import java.util.Map;
          +
          +import static java.util.Objects.requireNonNull;
          +
          +/**
          + * Flink Resource Manager for Apache Mesos.
          + */
          +public class MesosFlinkResourceManager extends FlinkResourceManager<RegisteredMesosWorkerNode> {
          +
          + /** The Mesos configuration (master and framework info) */
          + private final MesosConfiguration mesosConfig;
          +
          + /** The TaskManager container parameters (like container memory size) */
          + private final MesosTaskManagerParameters taskManagerParameters;
          +
          + /** Context information used to start a TaskManager Java process */
          + private final Protos.TaskInfo.Builder taskManagerLaunchContext;
          +
          + /** Number of failed Mesos tasks before stopping the application. -1 means infinite. */
          + private final int maxFailedTasks;
          +
          + /** Callback handler for the asynchronous Mesos scheduler */
          + private SchedulerProxy schedulerCallbackHandler;
          +
          + /** Mesos scheduler driver */
          + private SchedulerDriver schedulerDriver;
          +
          + private ActorRef connectionMonitor;
          +
          + private ActorRef taskRouter;
          +
          + private ActorRef launchCoordinator;
          +
          + private ActorRef reconciliationCoordinator;
          +
          + private MesosWorkerStore workerStore;
          +
          + final Map<ResourceID, MesosWorkerStore.Worker> workersInNew;
          + final Map<ResourceID, MesosWorkerStore.Worker> workersInLaunch;
          + final Map<ResourceID, MesosWorkerStore.Worker> workersBeingReturned;
          +
          + /** The number of failed tasks since the master became active */
          + private int failedTasksSoFar;
          +
          + public MesosFlinkResourceManager(
          + Configuration flinkConfig,
          + MesosConfiguration mesosConfig,
          + MesosWorkerStore workerStore,
          + LeaderRetrievalService leaderRetrievalService,
          + MesosTaskManagerParameters taskManagerParameters,
          + Protos.TaskInfo.Builder taskManagerLaunchContext,
          + int maxFailedTasks,
          + int numInitialTaskManagers)

          { + + super(numInitialTaskManagers, flinkConfig, leaderRetrievalService); + + this.mesosConfig = requireNonNull(mesosConfig); + + this.workerStore = requireNonNull(workerStore); + + this.taskManagerParameters = requireNonNull(taskManagerParameters); + this.taskManagerLaunchContext = requireNonNull(taskManagerLaunchContext); + this.maxFailedTasks = maxFailedTasks; + + this.workersInNew = new HashMap<>(); + this.workersInLaunch = new HashMap<>(); + this.workersBeingReturned = new HashMap<>(); + }

          +
          + // ------------------------------------------------------------------------
          + // Mesos-specific behavior
          + // ------------------------------------------------------------------------
          +
          + @Override
          + protected void initialize() throws Exception {
          + LOG.info("Initializing Mesos resource master");
          +
          + workerStore.start();
          +
          + // create the scheduler driver to communicate with Mesos
          + schedulerCallbackHandler = new SchedulerProxy(self());
          +
          + // register with Mesos
          + FrameworkInfo.Builder frameworkInfo = mesosConfig.frameworkInfo()
          + .clone()
          + .setCheckpoint(true);
          +
          + Option<Protos.FrameworkID> frameworkID = workerStore.getFrameworkID();
          + if(frameworkID.isEmpty())

          { + LOG.info("Registering as new framework."); + }

          + else {
          + LOG.info("Recovery scenario: re-registering using framework ID {}.", frameworkID.get().getValue());
          + frameworkInfo.setId(frameworkID.get());
          + }
          +
          + MesosConfiguration initializedMesosConfig = mesosConfig.withFrameworkInfo(frameworkInfo);
          + MesosConfiguration.logMesosConfig(LOG, initializedMesosConfig);
          + schedulerDriver = initializedMesosConfig.createDriver(schedulerCallbackHandler, false);
          +
          + // create supporting actors
          + connectionMonitor = createConnectionMonitor();
          + launchCoordinator = createLaunchCoordinator();
          + reconciliationCoordinator = createReconciliationCoordinator();
          + taskRouter = createTaskRouter();
          +
          + recoverWorkers();
          +
          + connectionMonitor.tell(new ConnectionMonitor.Start(), self());
          + schedulerDriver.start();
          + }
          +
          + protected ActorRef createConnectionMonitor()

          { + return context().actorOf( + ConnectionMonitor.createActorProps(ConnectionMonitor.class, config), + "connectionMonitor"); + }

          +
          + protected ActorRef createTaskRouter()

          { + return context().actorOf( + Tasks.createActorProps(Tasks.class, config, schedulerDriver, TaskMonitor.class), + "tasks"); + }

          +
          + protected ActorRef createLaunchCoordinator()

          { + return context().actorOf( + LaunchCoordinator.createActorProps(LaunchCoordinator.class, self(), config, schedulerDriver, createOptimizer()), + "launchCoordinator"); + }

          +
          + protected ActorRef createReconciliationCoordinator()

          { + return context().actorOf( + ReconciliationCoordinator.createActorProps(ReconciliationCoordinator.class, config, schedulerDriver), + "reconciliationCoordinator"); + }

          +
          + @Override
          + public void postStop()

          { + LOG.info("Stopping Mesos resource master"); + super.postStop(); + }

          +
          + // ------------------------------------------------------------------------
          + // Actor messages
          + // ------------------------------------------------------------------------
          +
          + @Override
          + protected void handleMessage(Object message) {
          +
          + // check for Mesos-specific actor messages first
          +
          + // — messages about Mesos connection
          + if (message instanceof Registered)

          { + registered((Registered) message); + }

          else if (message instanceof ReRegistered)

          { + reregistered((ReRegistered) message); + }

          else if (message instanceof Disconnected)

          { + disconnected((Disconnected) message); + }

          else if (message instanceof Error)

          { + error(((Error) message).message()); + + // --- messages about offers + }

          else if (message instanceof ResourceOffers || message instanceof OfferRescinded)

          { + launchCoordinator.tell(message, self()); + }

          else if (message instanceof AcceptOffers)

          { + acceptOffers((AcceptOffers) message); + + // --- messages about tasks + }

          else if (message instanceof StatusUpdate)

          { + taskStatusUpdated((StatusUpdate) message); + }

          else if (message instanceof ReconciliationCoordinator.Reconcile)

          { + // a reconciliation request from a task + reconciliationCoordinator.tell(message, self()); + }

          else if (message instanceof TaskMonitor.TaskTerminated)

          { + // a termination message from a task + TaskMonitor.TaskTerminated msg = (TaskMonitor.TaskTerminated) message; + taskTerminated(msg.taskID(), msg.status()); + + }

          else

          { + // message handled by the generic resource master code + super.handleMessage(message); + }

          + }
          +
          + /**
          + * Called to shut down the cluster (not a failover situation).
          + *
          + * @param finalStatus The application status to report.
          + * @param optionalDiagnostics An optional diagnostics message.
          + */
          + @Override
          + protected void shutdownApplication(ApplicationStatus finalStatus, String optionalDiagnostics) {
          +
          + LOG.info("Shutting down and unregistering as a Mesos framework.");
          + try

          { + // unregister the framework, which implicitly removes all tasks. + schedulerDriver.stop(false); + }

          + catch(Exception ex)

          { + LOG.warn("unable to unregister the framework", ex); + }

          +
          + try

          { + workerStore.cleanup(); + }

          + catch(Exception ex)

          { + LOG.warn("unable to cleanup the ZooKeeper state", ex); + }

          +
          + context().stop(self());
          + }
          +
          + @Override
          + protected void fatalError(String message, Throwable error)

          { + // we do not unregister, but cause a hard fail of this process, to have it + // restarted by the dispatcher + LOG.error("FATAL ERROR IN MESOS APPLICATION MASTER: " + message, error); + LOG.error("Shutting down process"); + + // kill this process, this will make an external supervisor (the dispatcher) restart the process + System.exit(EXIT_CODE_FATAL_ERROR); + }

          +
          + // ------------------------------------------------------------------------
          + // Worker Management
          + // ------------------------------------------------------------------------
          +
          + /**
          + * Recover framework/worker information persisted by a prior incarnation of the RM.
          + */
          + private void recoverWorkers() throws Exception {
          + // if this application master starts as part of an ApplicationMaster/JobManager recovery,
          + // then some worker tasks are most likely still alive and we can re-obtain them
          + final List<MesosWorkerStore.Worker> tasksFromPreviousAttempts = workerStore.recoverWorkers();
          +
          + if (!tasksFromPreviousAttempts.isEmpty()) {
          + LOG.info("Retrieved {} TaskManagers from previous attempt", tasksFromPreviousAttempts.size());
          +
          + List<Tuple2<TaskRequest,String>> toAssign = new ArrayList<>(tasksFromPreviousAttempts.size());
          + List<LaunchableTask> toLaunch = new ArrayList<>(tasksFromPreviousAttempts.size());
          +
          + for (final MesosWorkerStore.Worker worker : tasksFromPreviousAttempts) {
          + LaunchableMesosWorker launchable = createLaunchableMesosWorker(worker.taskID());
          +
          + switch(worker.state())

          { + case New: + workersInNew.put(extractResourceID(worker.taskID()), worker); + toLaunch.add(launchable); + break; + case Launched: + workersInLaunch.put(extractResourceID(worker.taskID()), worker); + toAssign.add(new Tuple2<>(launchable.taskRequest(), worker.hostname().get())); + break; + case Released: + workersBeingReturned.put(extractResourceID(worker.taskID()), worker); + break; + }

          + taskRouter.tell(new TaskMonitor.TaskGoalStateUpdated(extractGoalState(worker)), self());
          + }
          +
          + // tell the launch coordinator about prior assignments
          + if(toAssign.size() >= 1)

          { + launchCoordinator.tell(new LaunchCoordinator.Assign(toAssign), self()); + }

          + // tell the launch coordinator to launch any new tasks
          + if(toLaunch.size() >= 1)

          { + launchCoordinator.tell(new LaunchCoordinator.Launch(toLaunch), self()); + }
          + }
          + }
          +
          + /**
          + * Plan for some additional workers to be launched.
          + *
          + * @param numWorkers The number of workers to allocate.
          + */
          + @Override
          + protected void requestNewWorkers(int numWorkers) {
          +
          + try {
          + List<TaskMonitor.TaskGoalStateUpdated> toMonitor = new ArrayList<>(numWorkers);
          + List<LaunchableTask> toLaunch = new ArrayList<>(numWorkers);
          +
          + // generate new workers into persistent state and launch associated actors
          + for (int i = 0; i < numWorkers; i++) {
          + MesosWorkerStore.Worker worker = MesosWorkerStore.Worker.newTask(workerStore.newTaskID());
          + workerStore.putWorker(worker);
          + workersInNew.put(extractResourceID(worker.taskID()), worker);
          +
          + LaunchableMesosWorker launchable = createLaunchableMesosWorker(worker.taskID());
          +
          + LOG.info("Scheduling Mesos task {} with ({} MB, {} cpus).",
          + launchable.taskID().getValue(), launchable.taskRequest().getMemory(), launchable.taskRequest().getCPUs());
          +
          + toMonitor.add(new TaskMonitor.TaskGoalStateUpdated(extractGoalState(worker)));
          + toLaunch.add(launchable);
          + }
          +
          + // tell the task router about the new plans
          + for (TaskMonitor.TaskGoalStateUpdated update : toMonitor) { + taskRouter.tell(update, self()); + }
          +
          + // tell the launch coordinator to launch the new tasks
          + if(toLaunch.size() >= 1) { + launchCoordinator.tell(new LaunchCoordinator.Launch(toLaunch), self()); + }

          + }
          + catch(Exception ex)

          { + fatalError("unable to request new workers", ex); + }

          + }
          +
          + /**
          + * Accept offers as advised by the launch coordinator.
          + *
          + * Acceptance is routed through the RM to update the persistent state before
          + * forwarding the message to Mesos.
          + */
          + private void acceptOffers(AcceptOffers msg) {
          +
          + try {
          + List<TaskMonitor.TaskGoalStateUpdated> toMonitor = new ArrayList<>(msg.operations().size());
          +
          + // transition the persistent state of some tasks to Launched
          + for (Protos.Offer.Operation op : msg.operations()) {
          + if (op.getType() != Protos.Offer.Operation.Type.LAUNCH)

          { + continue; + }

          + for (Protos.TaskInfo info : op.getLaunch().getTaskInfosList()) {
          + MesosWorkerStore.Worker worker = workersInNew.remove(extractResourceID(info.getTaskId()));
          + assert (worker != null);
          +
          + worker = worker.launchTask(info.getSlaveId(), msg.hostname());
          + workerStore.putWorker(worker);
          + workersInLaunch.put(extractResourceID(worker.taskID()), worker);
          +
          + LOG.info("Launching Mesos task {} on host {}.",
          + worker.taskID().getValue(), worker.hostname().get());
          +
          + toMonitor.add(new TaskMonitor.TaskGoalStateUpdated(extractGoalState(worker)));
          + }
          + }
          +
          + // tell the task router about the new plans
          + for (TaskMonitor.TaskGoalStateUpdated update : toMonitor)

          { + taskRouter.tell(update, self()); + }

          +
          + // send the acceptance message to Mesos
          + schedulerDriver.acceptOffers(msg.offerIds(), msg.operations(), msg.filters());
          + }
          + catch(Exception ex)

          { + fatalError("unable to accept offers", ex); + }

          + }
          +
          + /**
          + * Handle a task status change.
          + */
          + private void taskStatusUpdated(StatusUpdate message)

          { + taskRouter.tell(message, self()); + reconciliationCoordinator.tell(message, self()); + schedulerDriver.acknowledgeStatusUpdate(message.status()); + }

          +
          + /**
          + * Accept the given started worker into the internal state.
          + *
          + * @param resourceID The worker resource id
          + * @return A registered worker node record.
          + */
          + @Override
          + protected RegisteredMesosWorkerNode workerStarted(ResourceID resourceID) {
          + MesosWorkerStore.Worker inLaunch = workersInLaunch.remove(resourceID);
          + if (inLaunch == null)

          { + // Worker was not in state "being launched", this can indicate that the TaskManager + // in this worker was already registered or that the container was not started + // by this resource manager. Simply ignore this resourceID. + return null; + }

          + return new RegisteredMesosWorkerNode(inLaunch);
          + }
          +
          + /**
          + * Accept the given registered workers into the internal state.
          + *
          + * @param toConsolidate The worker IDs known previously to the JobManager.
          + * @return A collection of registered worker node records.
          + */
          + @Override
          + protected Collection<RegisteredMesosWorkerNode> reacceptRegisteredWorkers(Collection<ResourceID> toConsolidate) {
          +
          + // we check for each task manager if we recognize its Mesos task ID
          + List<RegisteredMesosWorkerNode> accepted = new ArrayList<>(toConsolidate.size());
          + for (ResourceID resourceID : toConsolidate) {
          + MesosWorkerStore.Worker worker = workersInLaunch.remove(resourceID);
          + if (worker != null) {
          + LOG.info("Mesos worker consolidation recognizes TaskManager {}.", resourceID);
          + accepted.add(new RegisteredMesosWorkerNode(worker));
          + }
          + else {
          + if(isStarted(resourceID)) {
          + LOG.info("TaskManager {} has already been registered at the resource manager.", resourceID);
          + }
          + else {
          + LOG.info("Mesos worker consolidation does not recognize TaskManager {}.", resourceID);
          + }
          + }
          + }
          + return accepted;
          + }
          +
          + /**
          + * Release the given pending worker.
          + */
          + @Override
          + protected void releasePendingWorker(ResourceID id) {
          + MesosWorkerStore.Worker worker = workersInLaunch.remove(id);
          + if (worker != null)

          { + releaseWorker(worker); + }

          else {
          + LOG.error("Cannot find worker {} to release. Ignoring request.", id);
          + }
          + }
          +
          + /**
          + * Release the given started worker.
          + */
          + @Override
          + protected void releaseStartedWorker(RegisteredMesosWorkerNode worker)

          { + releaseWorker(worker.task()); + }

          +
          + /**
          + * Plan for the removal of the given worker.
          + */
          + private void releaseWorker(MesosWorkerStore.Worker worker) {
          + try {
          + LOG.info("Releasing worker {}", worker.taskID());
          +
          + // update persistent state of worker to Released
          + worker = worker.releaseTask();
          + workerStore.putWorker(worker);
          + workersBeingReturned.put(extractResourceID(worker.taskID()), worker);
          + taskRouter.tell(new TaskMonitor.TaskGoalStateUpdated(extractGoalState(worker)), self());
          +
          + if (worker.hostname().isDefined())

          { + // tell the launch coordinator that the task is being unassigned from the host, for planning purposes + launchCoordinator.tell(new LaunchCoordinator.Unassign(worker.taskID(), worker.hostname().get()), self()); + }

          + }
          + catch (Exception ex)

          { + fatalError("unable to release worker", ex); + }

          + }
          +
          + @Override
          + protected int getNumWorkerRequestsPending()

          { + return workersInNew.size(); + }

          +
          + @Override
          + protected int getNumWorkersPendingRegistration()

          { + return workersInLaunch.size(); + }

          +
          + // ------------------------------------------------------------------------
          + // Callbacks from the Mesos Master
          + // ------------------------------------------------------------------------
          +
          + /**
          + * Called when connected to Mesos as a new framework.
          + */
          + private void registered(Registered message) {
          + connectionMonitor.tell(message, self());
          +
          + try

          { + workerStore.setFrameworkID(Option.apply(message.frameworkId())); + }

          + catch(Exception ex)

          { + fatalError("unable to store the assigned framework ID", ex); + return; + }

          +
          + launchCoordinator.tell(message, self());
          + reconciliationCoordinator.tell(message, self());
          + taskRouter.tell(message, self());
          + }
          +
          + /**
          + * Called when reconnected to Mesos following a failover event.
          + */
          + private void reregistered(ReRegistered message)

          { + connectionMonitor.tell(message, self()); + launchCoordinator.tell(message, self()); + reconciliationCoordinator.tell(message, self()); + taskRouter.tell(message, self()); + }
          +
          + /**
          + * Called when disconnected from Mesos.
          + */
          + private void disconnected(Disconnected message) { + connectionMonitor.tell(message, self()); + launchCoordinator.tell(message, self()); + reconciliationCoordinator.tell(message, self()); + taskRouter.tell(message, self()); + }

          +
          + /**
          + * Called when an error is reported by the scheduler callback.
          + */
          + private void error(String message)

          { + self().tell(new FatalErrorOccurred("Connection to Mesos failed", new Exception(message)), self()); + }

          +
          + /**
          + * Invoked when a Mesos task reaches a terminal status.
          + */
          + private void taskTerminated(Protos.TaskID taskID, Protos.TaskStatus status) {
          + // this callback occurs for failed containers and for released containers alike
          +
          + final ResourceID id = extractResourceID(taskID);
          +
          + try

          { + workerStore.removeWorker(taskID); + }

          + catch(Exception ex)

          { + fatalError("unable to remove worker", ex); + return; + }

          +
          + // check if this is a failed task or a released task
          + if (workersBeingReturned.remove(id) != null) {
          + // regular finished worker that we released
          + LOG.info("Worker {} finished successfully with diagnostics: {}",
          + id, status.getMessage());
          + } else {
          + // failed worker, either at startup, or running
          + final MesosWorkerStore.Worker launched = workersInLaunch.remove(id);
          + if (launched != null) {
          + LOG.info("Mesos task {} failed, with a TaskManager in launch or registration. " +
          + "State: {} Reason: {} ({})", id, status.getState(), status.getReason(), status.getMessage());
          + // we will trigger re-acquiring new workers at the end
          + } else {
          + // failed registered worker
          + LOG.info("Mesos task {} failed, with a registered TaskManager. " +
          + "State: {} Reason: {} ({})", id, status.getState(), status.getReason(), status.getMessage());
          +
          + // notify the generic logic, which notifies the JobManager, etc.
          + notifyWorkerFailed(id, "Mesos task " + id + " failed. State: " + status.getState());
          + }
          +
          + // general failure logging
          + failedTasksSoFar++;
          +
          + String diagMessage = String.format("Diagnostics for task %s in state %s : " +
          + "reason=%s message=%s",
          + id, status.getState(), status.getReason(), status.getMessage());
          + sendInfoMessage(diagMessage);
          +
          + LOG.info(diagMessage);
          + LOG.info("Total number of failed tasks so far: " + failedTasksSoFar);
          +
          + // maxFailedTasks == -1 is infinite number of retries.
          + if (maxFailedTasks >= 0 && failedTasksSoFar > maxFailedTasks)

          { + String msg = "Stopping Mesos session because the number of failed tasks (" + + failedTasksSoFar + ") exceeded the maximum failed tasks (" + + maxFailedTasks + "). This number is controlled by the '" + + ConfigConstants.MESOS_MAX_FAILED_TASKS + "' configuration setting. " + + "By default its the number of requested tasks."; + + LOG.error(msg); + self().tell(decorateMessage(new StopCluster(ApplicationStatus.FAILED, msg)), + ActorRef.noSender()); + + // no need to do anything else + return; + }

          + }
          +
          + // in case failed containers were among the finished containers, make
          + // sure we re-examine and request new ones
          + triggerCheckWorkers();
          + }
          +
          + // ------------------------------------------------------------------------
          + // Utilities
          + // ------------------------------------------------------------------------
          +
          + private LaunchableMesosWorker createLaunchableMesosWorker(Protos.TaskID taskID) {
          + LaunchableMesosWorker launchable =
          + new LaunchableMesosWorker(taskManagerParameters, taskManagerLaunchContext, taskID);
          + return launchable;
          — End diff –

          I hear you but I often assign important data items a local variable first to be able to set a breakpoint for examination before returning.

          Show
          githubbot ASF GitHub Bot added a comment - Github user EronWright commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75325649 — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/MesosFlinkResourceManager.java — @@ -0,0 +1,755 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.runtime.clusterframework; + +import akka.actor.ActorRef; +import akka.actor.Props; +import com.netflix.fenzo.TaskRequest; +import com.netflix.fenzo.TaskScheduler; +import com.netflix.fenzo.VirtualMachineLease; +import com.netflix.fenzo.functions.Action1; +import org.apache.flink.api.java.tuple.Tuple2; +import org.apache.flink.configuration.ConfigConstants; +import org.apache.flink.configuration.Configuration; +import org.apache.flink.mesos.runtime.clusterframework.store.MesosWorkerStore; +import org.apache.flink.mesos.scheduler.ConnectionMonitor; +import org.apache.flink.mesos.scheduler.LaunchableTask; +import org.apache.flink.mesos.scheduler.LaunchCoordinator; +import org.apache.flink.mesos.scheduler.ReconciliationCoordinator; +import org.apache.flink.mesos.scheduler.SchedulerProxy; +import org.apache.flink.mesos.scheduler.TaskMonitor; +import org.apache.flink.mesos.scheduler.TaskSchedulerBuilder; +import org.apache.flink.mesos.scheduler.Tasks; +import org.apache.flink.mesos.scheduler.messages.AcceptOffers; +import org.apache.flink.mesos.scheduler.messages.Disconnected; +import org.apache.flink.mesos.scheduler.messages.Error; +import org.apache.flink.mesos.scheduler.messages.OfferRescinded; +import org.apache.flink.mesos.scheduler.messages.ReRegistered; +import org.apache.flink.mesos.scheduler.messages.Registered; +import org.apache.flink.mesos.scheduler.messages.ResourceOffers; +import org.apache.flink.mesos.scheduler.messages.StatusUpdate; +import org.apache.flink.mesos.util.MesosConfiguration; +import org.apache.flink.runtime.clusterframework.ApplicationStatus; +import org.apache.flink.runtime.clusterframework.FlinkResourceManager; +import org.apache.flink.runtime.clusterframework.messages.FatalErrorOccurred; +import org.apache.flink.runtime.clusterframework.messages.StopCluster; +import org.apache.flink.runtime.clusterframework.types.ResourceID; +import org.apache.flink.runtime.leaderretrieval.LeaderRetrievalService; +import org.apache.mesos.Protos; +import org.apache.mesos.Protos.FrameworkInfo; +import org.apache.mesos.SchedulerDriver; +import org.slf4j.Logger; +import scala.Option; + +import java.util.ArrayList; +import java.util.Collection; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +import static java.util.Objects.requireNonNull; + +/** + * Flink Resource Manager for Apache Mesos. + */ +public class MesosFlinkResourceManager extends FlinkResourceManager<RegisteredMesosWorkerNode> { + + /** The Mesos configuration (master and framework info) */ + private final MesosConfiguration mesosConfig; + + /** The TaskManager container parameters (like container memory size) */ + private final MesosTaskManagerParameters taskManagerParameters; + + /** Context information used to start a TaskManager Java process */ + private final Protos.TaskInfo.Builder taskManagerLaunchContext; + + /** Number of failed Mesos tasks before stopping the application. -1 means infinite. */ + private final int maxFailedTasks; + + /** Callback handler for the asynchronous Mesos scheduler */ + private SchedulerProxy schedulerCallbackHandler; + + /** Mesos scheduler driver */ + private SchedulerDriver schedulerDriver; + + private ActorRef connectionMonitor; + + private ActorRef taskRouter; + + private ActorRef launchCoordinator; + + private ActorRef reconciliationCoordinator; + + private MesosWorkerStore workerStore; + + final Map<ResourceID, MesosWorkerStore.Worker> workersInNew; + final Map<ResourceID, MesosWorkerStore.Worker> workersInLaunch; + final Map<ResourceID, MesosWorkerStore.Worker> workersBeingReturned; + + /** The number of failed tasks since the master became active */ + private int failedTasksSoFar; + + public MesosFlinkResourceManager( + Configuration flinkConfig, + MesosConfiguration mesosConfig, + MesosWorkerStore workerStore, + LeaderRetrievalService leaderRetrievalService, + MesosTaskManagerParameters taskManagerParameters, + Protos.TaskInfo.Builder taskManagerLaunchContext, + int maxFailedTasks, + int numInitialTaskManagers) { + + super(numInitialTaskManagers, flinkConfig, leaderRetrievalService); + + this.mesosConfig = requireNonNull(mesosConfig); + + this.workerStore = requireNonNull(workerStore); + + this.taskManagerParameters = requireNonNull(taskManagerParameters); + this.taskManagerLaunchContext = requireNonNull(taskManagerLaunchContext); + this.maxFailedTasks = maxFailedTasks; + + this.workersInNew = new HashMap<>(); + this.workersInLaunch = new HashMap<>(); + this.workersBeingReturned = new HashMap<>(); + } + + // ------------------------------------------------------------------------ + // Mesos-specific behavior + // ------------------------------------------------------------------------ + + @Override + protected void initialize() throws Exception { + LOG.info("Initializing Mesos resource master"); + + workerStore.start(); + + // create the scheduler driver to communicate with Mesos + schedulerCallbackHandler = new SchedulerProxy(self()); + + // register with Mesos + FrameworkInfo.Builder frameworkInfo = mesosConfig.frameworkInfo() + .clone() + .setCheckpoint(true); + + Option<Protos.FrameworkID> frameworkID = workerStore.getFrameworkID(); + if(frameworkID.isEmpty()) { + LOG.info("Registering as new framework."); + } + else { + LOG.info("Recovery scenario: re-registering using framework ID {}.", frameworkID.get().getValue()); + frameworkInfo.setId(frameworkID.get()); + } + + MesosConfiguration initializedMesosConfig = mesosConfig.withFrameworkInfo(frameworkInfo); + MesosConfiguration.logMesosConfig(LOG, initializedMesosConfig); + schedulerDriver = initializedMesosConfig.createDriver(schedulerCallbackHandler, false); + + // create supporting actors + connectionMonitor = createConnectionMonitor(); + launchCoordinator = createLaunchCoordinator(); + reconciliationCoordinator = createReconciliationCoordinator(); + taskRouter = createTaskRouter(); + + recoverWorkers(); + + connectionMonitor.tell(new ConnectionMonitor.Start(), self()); + schedulerDriver.start(); + } + + protected ActorRef createConnectionMonitor() { + return context().actorOf( + ConnectionMonitor.createActorProps(ConnectionMonitor.class, config), + "connectionMonitor"); + } + + protected ActorRef createTaskRouter() { + return context().actorOf( + Tasks.createActorProps(Tasks.class, config, schedulerDriver, TaskMonitor.class), + "tasks"); + } + + protected ActorRef createLaunchCoordinator() { + return context().actorOf( + LaunchCoordinator.createActorProps(LaunchCoordinator.class, self(), config, schedulerDriver, createOptimizer()), + "launchCoordinator"); + } + + protected ActorRef createReconciliationCoordinator() { + return context().actorOf( + ReconciliationCoordinator.createActorProps(ReconciliationCoordinator.class, config, schedulerDriver), + "reconciliationCoordinator"); + } + + @Override + public void postStop() { + LOG.info("Stopping Mesos resource master"); + super.postStop(); + } + + // ------------------------------------------------------------------------ + // Actor messages + // ------------------------------------------------------------------------ + + @Override + protected void handleMessage(Object message) { + + // check for Mesos-specific actor messages first + + // — messages about Mesos connection + if (message instanceof Registered) { + registered((Registered) message); + } else if (message instanceof ReRegistered) { + reregistered((ReRegistered) message); + } else if (message instanceof Disconnected) { + disconnected((Disconnected) message); + } else if (message instanceof Error) { + error(((Error) message).message()); + + // --- messages about offers + } else if (message instanceof ResourceOffers || message instanceof OfferRescinded) { + launchCoordinator.tell(message, self()); + } else if (message instanceof AcceptOffers) { + acceptOffers((AcceptOffers) message); + + // --- messages about tasks + } else if (message instanceof StatusUpdate) { + taskStatusUpdated((StatusUpdate) message); + } else if (message instanceof ReconciliationCoordinator.Reconcile) { + // a reconciliation request from a task + reconciliationCoordinator.tell(message, self()); + } else if (message instanceof TaskMonitor.TaskTerminated) { + // a termination message from a task + TaskMonitor.TaskTerminated msg = (TaskMonitor.TaskTerminated) message; + taskTerminated(msg.taskID(), msg.status()); + + } else { + // message handled by the generic resource master code + super.handleMessage(message); + } + } + + /** + * Called to shut down the cluster (not a failover situation). + * + * @param finalStatus The application status to report. + * @param optionalDiagnostics An optional diagnostics message. + */ + @Override + protected void shutdownApplication(ApplicationStatus finalStatus, String optionalDiagnostics) { + + LOG.info("Shutting down and unregistering as a Mesos framework."); + try { + // unregister the framework, which implicitly removes all tasks. + schedulerDriver.stop(false); + } + catch(Exception ex) { + LOG.warn("unable to unregister the framework", ex); + } + + try { + workerStore.cleanup(); + } + catch(Exception ex) { + LOG.warn("unable to cleanup the ZooKeeper state", ex); + } + + context().stop(self()); + } + + @Override + protected void fatalError(String message, Throwable error) { + // we do not unregister, but cause a hard fail of this process, to have it + // restarted by the dispatcher + LOG.error("FATAL ERROR IN MESOS APPLICATION MASTER: " + message, error); + LOG.error("Shutting down process"); + + // kill this process, this will make an external supervisor (the dispatcher) restart the process + System.exit(EXIT_CODE_FATAL_ERROR); + } + + // ------------------------------------------------------------------------ + // Worker Management + // ------------------------------------------------------------------------ + + /** + * Recover framework/worker information persisted by a prior incarnation of the RM. + */ + private void recoverWorkers() throws Exception { + // if this application master starts as part of an ApplicationMaster/JobManager recovery, + // then some worker tasks are most likely still alive and we can re-obtain them + final List<MesosWorkerStore.Worker> tasksFromPreviousAttempts = workerStore.recoverWorkers(); + + if (!tasksFromPreviousAttempts.isEmpty()) { + LOG.info("Retrieved {} TaskManagers from previous attempt", tasksFromPreviousAttempts.size()); + + List<Tuple2<TaskRequest,String>> toAssign = new ArrayList<>(tasksFromPreviousAttempts.size()); + List<LaunchableTask> toLaunch = new ArrayList<>(tasksFromPreviousAttempts.size()); + + for (final MesosWorkerStore.Worker worker : tasksFromPreviousAttempts) { + LaunchableMesosWorker launchable = createLaunchableMesosWorker(worker.taskID()); + + switch(worker.state()) { + case New: + workersInNew.put(extractResourceID(worker.taskID()), worker); + toLaunch.add(launchable); + break; + case Launched: + workersInLaunch.put(extractResourceID(worker.taskID()), worker); + toAssign.add(new Tuple2<>(launchable.taskRequest(), worker.hostname().get())); + break; + case Released: + workersBeingReturned.put(extractResourceID(worker.taskID()), worker); + break; + } + taskRouter.tell(new TaskMonitor.TaskGoalStateUpdated(extractGoalState(worker)), self()); + } + + // tell the launch coordinator about prior assignments + if(toAssign.size() >= 1) { + launchCoordinator.tell(new LaunchCoordinator.Assign(toAssign), self()); + } + // tell the launch coordinator to launch any new tasks + if(toLaunch.size() >= 1) { + launchCoordinator.tell(new LaunchCoordinator.Launch(toLaunch), self()); + } + } + } + + /** + * Plan for some additional workers to be launched. + * + * @param numWorkers The number of workers to allocate. + */ + @Override + protected void requestNewWorkers(int numWorkers) { + + try { + List<TaskMonitor.TaskGoalStateUpdated> toMonitor = new ArrayList<>(numWorkers); + List<LaunchableTask> toLaunch = new ArrayList<>(numWorkers); + + // generate new workers into persistent state and launch associated actors + for (int i = 0; i < numWorkers; i++) { + MesosWorkerStore.Worker worker = MesosWorkerStore.Worker.newTask(workerStore.newTaskID()); + workerStore.putWorker(worker); + workersInNew.put(extractResourceID(worker.taskID()), worker); + + LaunchableMesosWorker launchable = createLaunchableMesosWorker(worker.taskID()); + + LOG.info("Scheduling Mesos task {} with ({} MB, {} cpus).", + launchable.taskID().getValue(), launchable.taskRequest().getMemory(), launchable.taskRequest().getCPUs()); + + toMonitor.add(new TaskMonitor.TaskGoalStateUpdated(extractGoalState(worker))); + toLaunch.add(launchable); + } + + // tell the task router about the new plans + for (TaskMonitor.TaskGoalStateUpdated update : toMonitor) { + taskRouter.tell(update, self()); + } + + // tell the launch coordinator to launch the new tasks + if(toLaunch.size() >= 1) { + launchCoordinator.tell(new LaunchCoordinator.Launch(toLaunch), self()); + } + } + catch(Exception ex) { + fatalError("unable to request new workers", ex); + } + } + + /** + * Accept offers as advised by the launch coordinator. + * + * Acceptance is routed through the RM to update the persistent state before + * forwarding the message to Mesos. + */ + private void acceptOffers(AcceptOffers msg) { + + try { + List<TaskMonitor.TaskGoalStateUpdated> toMonitor = new ArrayList<>(msg.operations().size()); + + // transition the persistent state of some tasks to Launched + for (Protos.Offer.Operation op : msg.operations()) { + if (op.getType() != Protos.Offer.Operation.Type.LAUNCH) { + continue; + } + for (Protos.TaskInfo info : op.getLaunch().getTaskInfosList()) { + MesosWorkerStore.Worker worker = workersInNew.remove(extractResourceID(info.getTaskId())); + assert (worker != null); + + worker = worker.launchTask(info.getSlaveId(), msg.hostname()); + workerStore.putWorker(worker); + workersInLaunch.put(extractResourceID(worker.taskID()), worker); + + LOG.info("Launching Mesos task {} on host {}.", + worker.taskID().getValue(), worker.hostname().get()); + + toMonitor.add(new TaskMonitor.TaskGoalStateUpdated(extractGoalState(worker))); + } + } + + // tell the task router about the new plans + for (TaskMonitor.TaskGoalStateUpdated update : toMonitor) { + taskRouter.tell(update, self()); + } + + // send the acceptance message to Mesos + schedulerDriver.acceptOffers(msg.offerIds(), msg.operations(), msg.filters()); + } + catch(Exception ex) { + fatalError("unable to accept offers", ex); + } + } + + /** + * Handle a task status change. + */ + private void taskStatusUpdated(StatusUpdate message) { + taskRouter.tell(message, self()); + reconciliationCoordinator.tell(message, self()); + schedulerDriver.acknowledgeStatusUpdate(message.status()); + } + + /** + * Accept the given started worker into the internal state. + * + * @param resourceID The worker resource id + * @return A registered worker node record. + */ + @Override + protected RegisteredMesosWorkerNode workerStarted(ResourceID resourceID) { + MesosWorkerStore.Worker inLaunch = workersInLaunch.remove(resourceID); + if (inLaunch == null) { + // Worker was not in state "being launched", this can indicate that the TaskManager + // in this worker was already registered or that the container was not started + // by this resource manager. Simply ignore this resourceID. + return null; + } + return new RegisteredMesosWorkerNode(inLaunch); + } + + /** + * Accept the given registered workers into the internal state. + * + * @param toConsolidate The worker IDs known previously to the JobManager. + * @return A collection of registered worker node records. + */ + @Override + protected Collection<RegisteredMesosWorkerNode> reacceptRegisteredWorkers(Collection<ResourceID> toConsolidate) { + + // we check for each task manager if we recognize its Mesos task ID + List<RegisteredMesosWorkerNode> accepted = new ArrayList<>(toConsolidate.size()); + for (ResourceID resourceID : toConsolidate) { + MesosWorkerStore.Worker worker = workersInLaunch.remove(resourceID); + if (worker != null) { + LOG.info("Mesos worker consolidation recognizes TaskManager {}.", resourceID); + accepted.add(new RegisteredMesosWorkerNode(worker)); + } + else { + if(isStarted(resourceID)) { + LOG.info("TaskManager {} has already been registered at the resource manager.", resourceID); + } + else { + LOG.info("Mesos worker consolidation does not recognize TaskManager {}.", resourceID); + } + } + } + return accepted; + } + + /** + * Release the given pending worker. + */ + @Override + protected void releasePendingWorker(ResourceID id) { + MesosWorkerStore.Worker worker = workersInLaunch.remove(id); + if (worker != null) { + releaseWorker(worker); + } else { + LOG.error("Cannot find worker {} to release. Ignoring request.", id); + } + } + + /** + * Release the given started worker. + */ + @Override + protected void releaseStartedWorker(RegisteredMesosWorkerNode worker) { + releaseWorker(worker.task()); + } + + /** + * Plan for the removal of the given worker. + */ + private void releaseWorker(MesosWorkerStore.Worker worker) { + try { + LOG.info("Releasing worker {}", worker.taskID()); + + // update persistent state of worker to Released + worker = worker.releaseTask(); + workerStore.putWorker(worker); + workersBeingReturned.put(extractResourceID(worker.taskID()), worker); + taskRouter.tell(new TaskMonitor.TaskGoalStateUpdated(extractGoalState(worker)), self()); + + if (worker.hostname().isDefined()) { + // tell the launch coordinator that the task is being unassigned from the host, for planning purposes + launchCoordinator.tell(new LaunchCoordinator.Unassign(worker.taskID(), worker.hostname().get()), self()); + } + } + catch (Exception ex) { + fatalError("unable to release worker", ex); + } + } + + @Override + protected int getNumWorkerRequestsPending() { + return workersInNew.size(); + } + + @Override + protected int getNumWorkersPendingRegistration() { + return workersInLaunch.size(); + } + + // ------------------------------------------------------------------------ + // Callbacks from the Mesos Master + // ------------------------------------------------------------------------ + + /** + * Called when connected to Mesos as a new framework. + */ + private void registered(Registered message) { + connectionMonitor.tell(message, self()); + + try { + workerStore.setFrameworkID(Option.apply(message.frameworkId())); + } + catch(Exception ex) { + fatalError("unable to store the assigned framework ID", ex); + return; + } + + launchCoordinator.tell(message, self()); + reconciliationCoordinator.tell(message, self()); + taskRouter.tell(message, self()); + } + + /** + * Called when reconnected to Mesos following a failover event. + */ + private void reregistered(ReRegistered message) { + connectionMonitor.tell(message, self()); + launchCoordinator.tell(message, self()); + reconciliationCoordinator.tell(message, self()); + taskRouter.tell(message, self()); + } + + /** + * Called when disconnected from Mesos. + */ + private void disconnected(Disconnected message) { + connectionMonitor.tell(message, self()); + launchCoordinator.tell(message, self()); + reconciliationCoordinator.tell(message, self()); + taskRouter.tell(message, self()); + } + + /** + * Called when an error is reported by the scheduler callback. + */ + private void error(String message) { + self().tell(new FatalErrorOccurred("Connection to Mesos failed", new Exception(message)), self()); + } + + /** + * Invoked when a Mesos task reaches a terminal status. + */ + private void taskTerminated(Protos.TaskID taskID, Protos.TaskStatus status) { + // this callback occurs for failed containers and for released containers alike + + final ResourceID id = extractResourceID(taskID); + + try { + workerStore.removeWorker(taskID); + } + catch(Exception ex) { + fatalError("unable to remove worker", ex); + return; + } + + // check if this is a failed task or a released task + if (workersBeingReturned.remove(id) != null) { + // regular finished worker that we released + LOG.info("Worker {} finished successfully with diagnostics: {}", + id, status.getMessage()); + } else { + // failed worker, either at startup, or running + final MesosWorkerStore.Worker launched = workersInLaunch.remove(id); + if (launched != null) { + LOG.info("Mesos task {} failed, with a TaskManager in launch or registration. " + + "State: {} Reason: {} ({})", id, status.getState(), status.getReason(), status.getMessage()); + // we will trigger re-acquiring new workers at the end + } else { + // failed registered worker + LOG.info("Mesos task {} failed, with a registered TaskManager. " + + "State: {} Reason: {} ({})", id, status.getState(), status.getReason(), status.getMessage()); + + // notify the generic logic, which notifies the JobManager, etc. + notifyWorkerFailed(id, "Mesos task " + id + " failed. State: " + status.getState()); + } + + // general failure logging + failedTasksSoFar++; + + String diagMessage = String.format("Diagnostics for task %s in state %s : " + + "reason=%s message=%s", + id, status.getState(), status.getReason(), status.getMessage()); + sendInfoMessage(diagMessage); + + LOG.info(diagMessage); + LOG.info("Total number of failed tasks so far: " + failedTasksSoFar); + + // maxFailedTasks == -1 is infinite number of retries. + if (maxFailedTasks >= 0 && failedTasksSoFar > maxFailedTasks) { + String msg = "Stopping Mesos session because the number of failed tasks (" + + failedTasksSoFar + ") exceeded the maximum failed tasks (" + + maxFailedTasks + "). This number is controlled by the '" + + ConfigConstants.MESOS_MAX_FAILED_TASKS + "' configuration setting. " + + "By default its the number of requested tasks."; + + LOG.error(msg); + self().tell(decorateMessage(new StopCluster(ApplicationStatus.FAILED, msg)), + ActorRef.noSender()); + + // no need to do anything else + return; + } + } + + // in case failed containers were among the finished containers, make + // sure we re-examine and request new ones + triggerCheckWorkers(); + } + + // ------------------------------------------------------------------------ + // Utilities + // ------------------------------------------------------------------------ + + private LaunchableMesosWorker createLaunchableMesosWorker(Protos.TaskID taskID) { + LaunchableMesosWorker launchable = + new LaunchableMesosWorker(taskManagerParameters, taskManagerLaunchContext, taskID); + return launchable; — End diff – I hear you but I often assign important data items a local variable first to be able to set a breakpoint for examination before returning.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user EronWright commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75325271

          — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/MesosConfigKeys.java —
          @@ -0,0 +1,44 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.runtime.clusterframework;
          +
          +/**
          + * The Mesos environment variables used for settings of the containers.
          + */
          +public class MesosConfigKeys {
          — End diff –

          I considered that but chose to minimize the impact on the YARN code to avoid merge conflicts for such a large PR. How about we tackle this in a follow-up?

          Show
          githubbot ASF GitHub Bot added a comment - Github user EronWright commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75325271 — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/MesosConfigKeys.java — @@ -0,0 +1,44 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.runtime.clusterframework; + +/** + * The Mesos environment variables used for settings of the containers. + */ +public class MesosConfigKeys { — End diff – I considered that but chose to minimize the impact on the YARN code to avoid merge conflicts for such a large PR. How about we tackle this in a follow-up?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user EronWright commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75323135

          — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/scheduler/SchedulerProxy.java —
          @@ -0,0 +1,105 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.scheduler;
          +
          +import akka.actor.ActorRef;
          +
          +import org.apache.flink.mesos.scheduler.messages.Disconnected;
          +import org.apache.flink.mesos.scheduler.messages.Error;
          +import org.apache.flink.mesos.scheduler.messages.Error;
          +import org.apache.flink.mesos.scheduler.messages.OfferRescinded;
          +import org.apache.flink.mesos.scheduler.messages.ReRegistered;
          +import org.apache.flink.mesos.scheduler.messages.Registered;
          +import org.apache.flink.mesos.scheduler.messages.ResourceOffers;
          +import org.apache.flink.mesos.scheduler.messages.SlaveLost;
          +import org.apache.flink.mesos.scheduler.messages.StatusUpdate;
          +import org.apache.mesos.Protos;
          +import org.apache.mesos.Scheduler;
          +import org.apache.mesos.SchedulerDriver;
          +
          +import java.util.List;
          +
          +/**
          + * This class reacts to callbacks from the Mesos scheduler driver.
          + *
          + * In order to preserve actor concurrency safety, this class simply sends
          + * corresponding messages to the Mesos resource master actor.
          + *
          + * See https://mesos.apache.org/api/latest/java/org/apache/mesos/Scheduler.html
          + */
          +public class SchedulerProxy implements Scheduler {
          +
          + /** The actor to which we report the callbacks */
          + private ActorRef mesosActor;
          — End diff –

          @mxm it is generic because both the RM and the Dispatcher uses the code in the `scheduler` package. I'm guessing you were thinking the field's name could be more concrete.

          Show
          githubbot ASF GitHub Bot added a comment - Github user EronWright commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75323135 — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/scheduler/SchedulerProxy.java — @@ -0,0 +1,105 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.scheduler; + +import akka.actor.ActorRef; + +import org.apache.flink.mesos.scheduler.messages.Disconnected; +import org.apache.flink.mesos.scheduler.messages.Error; +import org.apache.flink.mesos.scheduler.messages.Error; +import org.apache.flink.mesos.scheduler.messages.OfferRescinded; +import org.apache.flink.mesos.scheduler.messages.ReRegistered; +import org.apache.flink.mesos.scheduler.messages.Registered; +import org.apache.flink.mesos.scheduler.messages.ResourceOffers; +import org.apache.flink.mesos.scheduler.messages.SlaveLost; +import org.apache.flink.mesos.scheduler.messages.StatusUpdate; +import org.apache.mesos.Protos; +import org.apache.mesos.Scheduler; +import org.apache.mesos.SchedulerDriver; + +import java.util.List; + +/** + * This class reacts to callbacks from the Mesos scheduler driver. + * + * In order to preserve actor concurrency safety, this class simply sends + * corresponding messages to the Mesos resource master actor. + * + * See https://mesos.apache.org/api/latest/java/org/apache/mesos/Scheduler.html + */ +public class SchedulerProxy implements Scheduler { + + /** The actor to which we report the callbacks */ + private ActorRef mesosActor; — End diff – @mxm it is generic because both the RM and the Dispatcher uses the code in the `scheduler` package. I'm guessing you were thinking the field's name could be more concrete.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user EronWright commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75319776

          — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/MesosFlinkResourceManager.java —
          @@ -0,0 +1,755 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.runtime.clusterframework;
          +
          +import akka.actor.ActorRef;
          +import akka.actor.Props;
          +import com.netflix.fenzo.TaskRequest;
          +import com.netflix.fenzo.TaskScheduler;
          +import com.netflix.fenzo.VirtualMachineLease;
          +import com.netflix.fenzo.functions.Action1;
          +import org.apache.flink.api.java.tuple.Tuple2;
          +import org.apache.flink.configuration.ConfigConstants;
          +import org.apache.flink.configuration.Configuration;
          +import org.apache.flink.mesos.runtime.clusterframework.store.MesosWorkerStore;
          +import org.apache.flink.mesos.scheduler.ConnectionMonitor;
          +import org.apache.flink.mesos.scheduler.LaunchableTask;
          +import org.apache.flink.mesos.scheduler.LaunchCoordinator;
          +import org.apache.flink.mesos.scheduler.ReconciliationCoordinator;
          +import org.apache.flink.mesos.scheduler.SchedulerProxy;
          +import org.apache.flink.mesos.scheduler.TaskMonitor;
          +import org.apache.flink.mesos.scheduler.TaskSchedulerBuilder;
          +import org.apache.flink.mesos.scheduler.Tasks;
          +import org.apache.flink.mesos.scheduler.messages.AcceptOffers;
          +import org.apache.flink.mesos.scheduler.messages.Disconnected;
          +import org.apache.flink.mesos.scheduler.messages.Error;
          +import org.apache.flink.mesos.scheduler.messages.OfferRescinded;
          +import org.apache.flink.mesos.scheduler.messages.ReRegistered;
          +import org.apache.flink.mesos.scheduler.messages.Registered;
          +import org.apache.flink.mesos.scheduler.messages.ResourceOffers;
          +import org.apache.flink.mesos.scheduler.messages.StatusUpdate;
          +import org.apache.flink.mesos.util.MesosConfiguration;
          +import org.apache.flink.runtime.clusterframework.ApplicationStatus;
          +import org.apache.flink.runtime.clusterframework.FlinkResourceManager;
          +import org.apache.flink.runtime.clusterframework.messages.FatalErrorOccurred;
          +import org.apache.flink.runtime.clusterframework.messages.StopCluster;
          +import org.apache.flink.runtime.clusterframework.types.ResourceID;
          +import org.apache.flink.runtime.leaderretrieval.LeaderRetrievalService;
          +import org.apache.mesos.Protos;
          +import org.apache.mesos.Protos.FrameworkInfo;
          +import org.apache.mesos.SchedulerDriver;
          +import org.slf4j.Logger;
          +import scala.Option;
          +
          +import java.util.ArrayList;
          +import java.util.Collection;
          +import java.util.HashMap;
          +import java.util.List;
          +import java.util.Map;
          +
          +import static java.util.Objects.requireNonNull;
          +
          +/**
          + * Flink Resource Manager for Apache Mesos.
          + */
          +public class MesosFlinkResourceManager extends FlinkResourceManager<RegisteredMesosWorkerNode> {
          +
          + /** The Mesos configuration (master and framework info) */
          + private final MesosConfiguration mesosConfig;
          +
          + /** The TaskManager container parameters (like container memory size) */
          + private final MesosTaskManagerParameters taskManagerParameters;
          +
          + /** Context information used to start a TaskManager Java process */
          + private final Protos.TaskInfo.Builder taskManagerLaunchContext;
          +
          + /** Number of failed Mesos tasks before stopping the application. -1 means infinite. */
          + private final int maxFailedTasks;
          +
          + /** Callback handler for the asynchronous Mesos scheduler */
          + private SchedulerProxy schedulerCallbackHandler;
          +
          + /** Mesos scheduler driver */
          + private SchedulerDriver schedulerDriver;
          +
          + private ActorRef connectionMonitor;
          +
          + private ActorRef taskRouter;
          +
          + private ActorRef launchCoordinator;
          +
          + private ActorRef reconciliationCoordinator;
          +
          + private MesosWorkerStore workerStore;
          +
          + final Map<ResourceID, MesosWorkerStore.Worker> workersInNew;
          + final Map<ResourceID, MesosWorkerStore.Worker> workersInLaunch;
          + final Map<ResourceID, MesosWorkerStore.Worker> workersBeingReturned;
          +
          + /** The number of failed tasks since the master became active */
          + private int failedTasksSoFar;
          +
          + public MesosFlinkResourceManager(
          + Configuration flinkConfig,
          + MesosConfiguration mesosConfig,
          + MesosWorkerStore workerStore,
          + LeaderRetrievalService leaderRetrievalService,
          + MesosTaskManagerParameters taskManagerParameters,
          + Protos.TaskInfo.Builder taskManagerLaunchContext,
          + int maxFailedTasks,
          + int numInitialTaskManagers) {
          +
          + super(numInitialTaskManagers, flinkConfig, leaderRetrievalService);
          +
          + this.mesosConfig = requireNonNull(mesosConfig);
          — End diff –

          No thanks, the other RM code uses `requireNonNull`.

          Show
          githubbot ASF GitHub Bot added a comment - Github user EronWright commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75319776 — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/MesosFlinkResourceManager.java — @@ -0,0 +1,755 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.runtime.clusterframework; + +import akka.actor.ActorRef; +import akka.actor.Props; +import com.netflix.fenzo.TaskRequest; +import com.netflix.fenzo.TaskScheduler; +import com.netflix.fenzo.VirtualMachineLease; +import com.netflix.fenzo.functions.Action1; +import org.apache.flink.api.java.tuple.Tuple2; +import org.apache.flink.configuration.ConfigConstants; +import org.apache.flink.configuration.Configuration; +import org.apache.flink.mesos.runtime.clusterframework.store.MesosWorkerStore; +import org.apache.flink.mesos.scheduler.ConnectionMonitor; +import org.apache.flink.mesos.scheduler.LaunchableTask; +import org.apache.flink.mesos.scheduler.LaunchCoordinator; +import org.apache.flink.mesos.scheduler.ReconciliationCoordinator; +import org.apache.flink.mesos.scheduler.SchedulerProxy; +import org.apache.flink.mesos.scheduler.TaskMonitor; +import org.apache.flink.mesos.scheduler.TaskSchedulerBuilder; +import org.apache.flink.mesos.scheduler.Tasks; +import org.apache.flink.mesos.scheduler.messages.AcceptOffers; +import org.apache.flink.mesos.scheduler.messages.Disconnected; +import org.apache.flink.mesos.scheduler.messages.Error; +import org.apache.flink.mesos.scheduler.messages.OfferRescinded; +import org.apache.flink.mesos.scheduler.messages.ReRegistered; +import org.apache.flink.mesos.scheduler.messages.Registered; +import org.apache.flink.mesos.scheduler.messages.ResourceOffers; +import org.apache.flink.mesos.scheduler.messages.StatusUpdate; +import org.apache.flink.mesos.util.MesosConfiguration; +import org.apache.flink.runtime.clusterframework.ApplicationStatus; +import org.apache.flink.runtime.clusterframework.FlinkResourceManager; +import org.apache.flink.runtime.clusterframework.messages.FatalErrorOccurred; +import org.apache.flink.runtime.clusterframework.messages.StopCluster; +import org.apache.flink.runtime.clusterframework.types.ResourceID; +import org.apache.flink.runtime.leaderretrieval.LeaderRetrievalService; +import org.apache.mesos.Protos; +import org.apache.mesos.Protos.FrameworkInfo; +import org.apache.mesos.SchedulerDriver; +import org.slf4j.Logger; +import scala.Option; + +import java.util.ArrayList; +import java.util.Collection; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +import static java.util.Objects.requireNonNull; + +/** + * Flink Resource Manager for Apache Mesos. + */ +public class MesosFlinkResourceManager extends FlinkResourceManager<RegisteredMesosWorkerNode> { + + /** The Mesos configuration (master and framework info) */ + private final MesosConfiguration mesosConfig; + + /** The TaskManager container parameters (like container memory size) */ + private final MesosTaskManagerParameters taskManagerParameters; + + /** Context information used to start a TaskManager Java process */ + private final Protos.TaskInfo.Builder taskManagerLaunchContext; + + /** Number of failed Mesos tasks before stopping the application. -1 means infinite. */ + private final int maxFailedTasks; + + /** Callback handler for the asynchronous Mesos scheduler */ + private SchedulerProxy schedulerCallbackHandler; + + /** Mesos scheduler driver */ + private SchedulerDriver schedulerDriver; + + private ActorRef connectionMonitor; + + private ActorRef taskRouter; + + private ActorRef launchCoordinator; + + private ActorRef reconciliationCoordinator; + + private MesosWorkerStore workerStore; + + final Map<ResourceID, MesosWorkerStore.Worker> workersInNew; + final Map<ResourceID, MesosWorkerStore.Worker> workersInLaunch; + final Map<ResourceID, MesosWorkerStore.Worker> workersBeingReturned; + + /** The number of failed tasks since the master became active */ + private int failedTasksSoFar; + + public MesosFlinkResourceManager( + Configuration flinkConfig, + MesosConfiguration mesosConfig, + MesosWorkerStore workerStore, + LeaderRetrievalService leaderRetrievalService, + MesosTaskManagerParameters taskManagerParameters, + Protos.TaskInfo.Builder taskManagerLaunchContext, + int maxFailedTasks, + int numInitialTaskManagers) { + + super(numInitialTaskManagers, flinkConfig, leaderRetrievalService); + + this.mesosConfig = requireNonNull(mesosConfig); — End diff – No thanks, the other RM code uses `requireNonNull`.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user EronWright commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75318417

          — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/cli/FlinkMesosSessionCli.java —
          @@ -0,0 +1,59 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.cli;
          +
          +import com.fasterxml.jackson.core.JsonProcessingException;
          +import com.fasterxml.jackson.core.type.TypeReference;
          +import com.fasterxml.jackson.databind.ObjectMapper;
          +import org.apache.flink.configuration.Configuration;
          +
          +import java.io.IOException;
          +import java.util.Map;
          +
          +public class FlinkMesosSessionCli {
          — End diff –

          @mxm you probably recognize this class as an analogue to `FlinkYarnSessionCli`. I placed the helper method here for consistency, but the actual CLI work don't be implemented until later this month. I'll write some comments but let us have an understanding that this file is a work-in-progress.

          Show
          githubbot ASF GitHub Bot added a comment - Github user EronWright commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75318417 — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/cli/FlinkMesosSessionCli.java — @@ -0,0 +1,59 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.cli; + +import com.fasterxml.jackson.core.JsonProcessingException; +import com.fasterxml.jackson.core.type.TypeReference; +import com.fasterxml.jackson.databind.ObjectMapper; +import org.apache.flink.configuration.Configuration; + +import java.io.IOException; +import java.util.Map; + +public class FlinkMesosSessionCli { — End diff – @mxm you probably recognize this class as an analogue to `FlinkYarnSessionCli`. I placed the helper method here for consistency, but the actual CLI work don't be implemented until later this month. I'll write some comments but let us have an understanding that this file is a work-in-progress.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on the issue:

          https://github.com/apache/flink/pull/2315

          Thank you for your work @EronWright. Finally could go through the code. All in all, very impressive as the first step of the Mesos integration! I think this PR is in a mergeable state if some minor comments are addressed.

          I'm not 100 % sure about all the additional actors yet. It seems like `ReconciliationCoordinator`, `ConnectionMonitor`, `TaskMonitor` could also easily be handled inside `MesosFlinkResourceManager`. In terms of modularity, I can see that having these run independently can give us a more flexible setup. Which of the actors do you plan to re-use in the next set of changes? Clearly, in terms of testability it comes in really handy.

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on the issue: https://github.com/apache/flink/pull/2315 Thank you for your work @EronWright. Finally could go through the code. All in all, very impressive as the first step of the Mesos integration! I think this PR is in a mergeable state if some minor comments are addressed. I'm not 100 % sure about all the additional actors yet. It seems like `ReconciliationCoordinator`, `ConnectionMonitor`, `TaskMonitor` could also easily be handled inside `MesosFlinkResourceManager`. In terms of modularity, I can see that having these run independently can give us a more flexible setup. Which of the actors do you plan to re-use in the next set of changes? Clearly, in terms of testability it comes in really handy.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75303392

          — Diff: flink-mesos/src/test/scala/org/apache/flink/mesos/scheduler/LaunchCoordinatorTest.scala —
          @@ -0,0 +1,439 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.scheduler
          +
          +import java.util.

          {Collections, UUID}

          +import java.util.concurrent.atomic.AtomicReference
          +
          +import akka.actor.FSM.StateTimeout
          +import akka.testkit._
          +import com.netflix.fenzo.TaskRequest.

          {AssignedResources, NamedResourceSetRequest}

          +import com.netflix.fenzo._
          +import com.netflix.fenzo.functions.

          {Action1, Action2}

          +import com.netflix.fenzo.plugins.VMLeaseObject
          +import org.apache.flink.api.java.tuple.

          {Tuple2=>FlinkTuple2}

          +import org.apache.flink.configuration.Configuration
          +import org.apache.flink.mesos.scheduler.LaunchCoordinator._
          +import org.apache.flink.mesos.scheduler.messages._
          +import org.apache.flink.runtime.akka.AkkaUtils
          +import org.apache.mesos.Protos.

          {SlaveID, TaskInfo}

          +import org.apache.mesos.

          {SchedulerDriver, Protos}

          +import org.junit.runner.RunWith
          +import org.mockito.Mockito.

          {verify, _}

          +import org.mockito.invocation.InvocationOnMock
          +import org.mockito.stubbing.Answer
          +import org.mockito.

          {Matchers => MM, Mockito}

          +import org.scalatest.junit.JUnitRunner
          +import org.scalatest.

          {BeforeAndAfterAll, Matchers, WordSpecLike}

          +
          +import scala.collection.JavaConverters._
          +
          +import org.apache.flink.mesos.Utils.range
          +import org.apache.flink.mesos.Utils.ranges
          +import org.apache.flink.mesos.Utils.scalar
          +
          +@RunWith(classOf[JUnitRunner])
          +class LaunchCoordinatorTest
          + extends TestKitBase
          + with ImplicitSender
          + with WordSpecLike
          + with Matchers
          + with BeforeAndAfterAll {
          +
          + lazy val config = new Configuration()
          + implicit lazy val system = AkkaUtils.createLocalActorSystem(config)
          +
          + override def afterAll(): Unit =

          { + TestKit.shutdownActorSystem(system) + }

          +
          + def randomFramework =

          { + Protos.FrameworkID.newBuilder().setValue(UUID.randomUUID.toString).build + }

          +
          + def randomTask = {
          + val taskID = Protos.TaskID.newBuilder.setValue(UUID.randomUUID.toString).build
          +
          + def generateTaskRequest = {
          + new TaskRequest() {
          + private[mesos] val assignedResources = new AtomicReference[TaskRequest.AssignedResources]
          + override def getId: String = taskID.getValue
          + override def taskGroupName: String = ""
          + override def getCPUs: Double = 1.0
          + override def getMemory: Double = 1024.0
          + override def getNetworkMbps: Double = 0.0
          + override def getDisk: Double = 0.0
          + override def getPorts: Int = 1
          + override def getCustomNamedResources: java.util.Map[String, NamedResourceSetRequest] =
          + Collections.emptyMap[String, NamedResourceSetRequest]
          + override def getSoftConstraints: java.util.List[_ <: VMTaskFitnessCalculator] = null
          + override def getHardConstraints: java.util.List[_ <: ConstraintEvaluator] = null
          + override def getAssignedResources: AssignedResources = assignedResources.get()
          + override def setAssignedResources(assignedResources: AssignedResources): Unit =

          { + this.assignedResources.set(assignedResources) + }

          + }
          + }
          +
          + val task: LaunchableTask = new LaunchableTask() {
          + override def taskRequest: TaskRequest = generateTaskRequest
          + override def launch(slaveId: SlaveID, taskAssignment: TaskAssignmentResult): Protos.TaskInfo =

          { + Protos.TaskInfo.newBuilder + .setTaskId(taskID).setName(taskID.getValue) + .setCommand(Protos.CommandInfo.newBuilder.setValue("whoami")) + .setSlaveId(slaveId) + .build() + }

          + override def toString = taskRequest.getId
          + }
          +
          + (taskID, task)
          + }
          +
          + def randomSlave = {
          + val slaveID = Protos.SlaveID.newBuilder.setValue(UUID.randomUUID.toString).build
          + val hostname = s"host-$

          {slaveID.getValue}

          "
          + (slaveID, hostname)
          + }
          +
          + def randomOffer(frameworkID: Protos.FrameworkID, slave: (Protos.SlaveID, String)) =

          { + val offerID = Protos.OfferID.newBuilder().setValue(UUID.randomUUID.toString) + Protos.Offer.newBuilder() + .setFrameworkId(frameworkID) + .setId(offerID) + .setSlaveId(slave._1) + .setHostname(slave._2) + .addResources(scalar("cpus", 0.75)) + .addResources(scalar("mem", 4096.0)) + .addResources(scalar("disk", 1024.0)) + .addResources(ranges("ports", range(9000, 9001))) + .build() + }

          +
          + def lease(offer: Protos.Offer) =

          { + new VMLeaseObject(offer) + }

          +
          + /**
          + * Mock a successful task assignment result matching a task to an offer.
          + */
          + def taskAssignmentResult(lease: VirtualMachineLease, task: TaskRequest): TaskAssignmentResult =

          { + val ports = lease.portRanges().get(0) + val r = mock(classOf[TaskAssignmentResult]) + when(r.getTaskId).thenReturn(task.getId) + when(r.getHostname).thenReturn(lease.hostname()) + when(r.getAssignedPorts).thenReturn( + (ports.getBeg to ports.getBeg + task.getPorts).toList.asJava.asInstanceOf[java.util.List[Integer]]) + when(r.getRequest).thenReturn(task) + when(r.isSuccessful).thenReturn(true) + when(r.getFitness).thenReturn(1.0) + r + }

          +
          + /**
          + * Mock a VM assignment result with the given leases and tasks.
          + */
          + def vmAssignmentResult(hostname: String,
          + leasesUsed: Seq[VirtualMachineLease],
          + tasksAssigned: Set[TaskAssignmentResult]): VMAssignmentResult =

          { + new VMAssignmentResult(hostname, leasesUsed.asJava, tasksAssigned.asJava) + }

          +
          + /**
          + * Mock a scheduling result with the given successes and failures.
          + */
          + def schedulingResult(successes: Seq[VMAssignmentResult],
          + failures: Seq[TaskAssignmentResult] = Nil,
          + exceptions: Seq[Exception] = Nil,
          + leasesAdded: Int = 0,
          + leasesRejected: Int = 0): SchedulingResult =

          { + val r = mock(classOf[SchedulingResult]) + when(r.getResultMap).thenReturn(successes.map(r => r.getHostname -> r).toMap.asJava) + when(r.getExceptions).thenReturn(exceptions.asJava) + val groupedFailures = failures.groupBy(_.getRequest).mapValues(_.asJava) + when(r.getFailures).thenReturn(groupedFailures.asJava) + when(r.getLeasesAdded).thenReturn(leasesAdded) + when(r.getLeasesRejected).thenReturn(leasesRejected) + when(r.getRuntime).thenReturn(0) + when(r.getNumAllocations).thenThrow(new NotImplementedError()) + when(r.getTotalVMsCount).thenThrow(new NotImplementedError()) + when(r.getIdleVMsCount).thenThrow(new NotImplementedError()) + r + }

          +
          +
          + /**
          + * Mock a task scheduler.
          + * The task assigner/unassigner is pre-wired.
          + */
          + def taskScheduler() =

          { + val optimizer = mock(classOf[TaskScheduler]) + val taskAssigner = mock(classOf[Action2[TaskRequest, String]]) + when[Action2[TaskRequest, String]](optimizer.getTaskAssigner).thenReturn(taskAssigner) + val taskUnassigner = mock(classOf[Action2[String, String]]) + when[Action2[String, String]](optimizer.getTaskUnAssigner).thenReturn(taskUnassigner) + optimizer + }

          +
          + /**
          + * Create a task scheduler builder.
          + */
          + def taskSchedulerBuilder(optimizer: TaskScheduler) = new TaskSchedulerBuilder {
          + var leaseRejectAction: Action1[VirtualMachineLease] = null
          + override def withLeaseRejectAction(action: Action1[VirtualMachineLease]): TaskSchedulerBuilder =

          { + leaseRejectAction = action + this + }

          + override def build(): TaskScheduler = optimizer
          + }
          +
          + /**
          + * Process a call to scheduleOnce with the given function.
          + */
          + def scheduleOnce(f: (Seq[TaskRequest],Seq[VirtualMachineLease]) => SchedulingResult) = {
          + new Answer[SchedulingResult] {
          + override def answer(invocationOnMock: InvocationOnMock): SchedulingResult =

          { + val args = invocationOnMock.getArguments + val requests = args(0).asInstanceOf[java.util.List[TaskRequest]] + val newLeases = args(1).asInstanceOf[java.util.List[VirtualMachineLease]] + f(requests.asScala, newLeases.asScala) + }

          + }
          + }
          +
          + /**
          + * The context fixture.
          + */
          + class Context {
          + val optimizer = taskScheduler()
          + val optimizerBuilder = taskSchedulerBuilder(optimizer)
          + val schedulerDriver = mock(classOf[SchedulerDriver])
          + val trace = Mockito.inOrder(schedulerDriver)
          + val fsm = TestFSMRef(new LaunchCoordinator(testActor, config, schedulerDriver, optimizerBuilder))
          +
          + val framework = randomFramework
          + val task1 = randomTask
          + val task2 = randomTask
          + val task3 = randomTask
          +
          + val slave1 =

          { + val slave = randomSlave + (slave._1, slave._2, randomOffer(framework, slave), randomOffer(framework, slave), randomOffer(framework, slave)) + }
          +
          + val slave2 = { + val slave = randomSlave + (slave._1, slave._2, randomOffer(framework, slave), randomOffer(framework, slave), randomOffer(framework, slave)) + }

          + }
          +
          + def inState = afterWord("in state")
          + def handle = afterWord("handle")
          +
          + def handlesAssignments(state: TaskState) = {
          + "Unassign" which {
          + s"stays in $state with updated optimizer state" in new Context {
          — End diff –

          Very nice testing although I'm not much of a fan of the Scala `WordSpec`

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75303392 — Diff: flink-mesos/src/test/scala/org/apache/flink/mesos/scheduler/LaunchCoordinatorTest.scala — @@ -0,0 +1,439 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.scheduler + +import java.util. {Collections, UUID} +import java.util.concurrent.atomic.AtomicReference + +import akka.actor.FSM.StateTimeout +import akka.testkit._ +import com.netflix.fenzo.TaskRequest. {AssignedResources, NamedResourceSetRequest} +import com.netflix.fenzo._ +import com.netflix.fenzo.functions. {Action1, Action2} +import com.netflix.fenzo.plugins.VMLeaseObject +import org.apache.flink.api.java.tuple. {Tuple2=>FlinkTuple2} +import org.apache.flink.configuration.Configuration +import org.apache.flink.mesos.scheduler.LaunchCoordinator._ +import org.apache.flink.mesos.scheduler.messages._ +import org.apache.flink.runtime.akka.AkkaUtils +import org.apache.mesos.Protos. {SlaveID, TaskInfo} +import org.apache.mesos. {SchedulerDriver, Protos} +import org.junit.runner.RunWith +import org.mockito.Mockito. {verify, _} +import org.mockito.invocation.InvocationOnMock +import org.mockito.stubbing.Answer +import org.mockito. {Matchers => MM, Mockito} +import org.scalatest.junit.JUnitRunner +import org.scalatest. {BeforeAndAfterAll, Matchers, WordSpecLike} + +import scala.collection.JavaConverters._ + +import org.apache.flink.mesos.Utils.range +import org.apache.flink.mesos.Utils.ranges +import org.apache.flink.mesos.Utils.scalar + +@RunWith(classOf [JUnitRunner] ) +class LaunchCoordinatorTest + extends TestKitBase + with ImplicitSender + with WordSpecLike + with Matchers + with BeforeAndAfterAll { + + lazy val config = new Configuration() + implicit lazy val system = AkkaUtils.createLocalActorSystem(config) + + override def afterAll(): Unit = { + TestKit.shutdownActorSystem(system) + } + + def randomFramework = { + Protos.FrameworkID.newBuilder().setValue(UUID.randomUUID.toString).build + } + + def randomTask = { + val taskID = Protos.TaskID.newBuilder.setValue(UUID.randomUUID.toString).build + + def generateTaskRequest = { + new TaskRequest() { + private [mesos] val assignedResources = new AtomicReference [TaskRequest.AssignedResources] + override def getId: String = taskID.getValue + override def taskGroupName: String = "" + override def getCPUs: Double = 1.0 + override def getMemory: Double = 1024.0 + override def getNetworkMbps: Double = 0.0 + override def getDisk: Double = 0.0 + override def getPorts: Int = 1 + override def getCustomNamedResources: java.util.Map [String, NamedResourceSetRequest] = + Collections.emptyMap [String, NamedResourceSetRequest] + override def getSoftConstraints: java.util.List [_ <: VMTaskFitnessCalculator] = null + override def getHardConstraints: java.util.List [_ <: ConstraintEvaluator] = null + override def getAssignedResources: AssignedResources = assignedResources.get() + override def setAssignedResources(assignedResources: AssignedResources): Unit = { + this.assignedResources.set(assignedResources) + } + } + } + + val task: LaunchableTask = new LaunchableTask() { + override def taskRequest: TaskRequest = generateTaskRequest + override def launch(slaveId: SlaveID, taskAssignment: TaskAssignmentResult): Protos.TaskInfo = { + Protos.TaskInfo.newBuilder + .setTaskId(taskID).setName(taskID.getValue) + .setCommand(Protos.CommandInfo.newBuilder.setValue("whoami")) + .setSlaveId(slaveId) + .build() + } + override def toString = taskRequest.getId + } + + (taskID, task) + } + + def randomSlave = { + val slaveID = Protos.SlaveID.newBuilder.setValue(UUID.randomUUID.toString).build + val hostname = s"host-$ {slaveID.getValue} " + (slaveID, hostname) + } + + def randomOffer(frameworkID: Protos.FrameworkID, slave: (Protos.SlaveID, String)) = { + val offerID = Protos.OfferID.newBuilder().setValue(UUID.randomUUID.toString) + Protos.Offer.newBuilder() + .setFrameworkId(frameworkID) + .setId(offerID) + .setSlaveId(slave._1) + .setHostname(slave._2) + .addResources(scalar("cpus", 0.75)) + .addResources(scalar("mem", 4096.0)) + .addResources(scalar("disk", 1024.0)) + .addResources(ranges("ports", range(9000, 9001))) + .build() + } + + def lease(offer: Protos.Offer) = { + new VMLeaseObject(offer) + } + + /** + * Mock a successful task assignment result matching a task to an offer. + */ + def taskAssignmentResult(lease: VirtualMachineLease, task: TaskRequest): TaskAssignmentResult = { + val ports = lease.portRanges().get(0) + val r = mock(classOf[TaskAssignmentResult]) + when(r.getTaskId).thenReturn(task.getId) + when(r.getHostname).thenReturn(lease.hostname()) + when(r.getAssignedPorts).thenReturn( + (ports.getBeg to ports.getBeg + task.getPorts).toList.asJava.asInstanceOf[java.util.List[Integer]]) + when(r.getRequest).thenReturn(task) + when(r.isSuccessful).thenReturn(true) + when(r.getFitness).thenReturn(1.0) + r + } + + /** + * Mock a VM assignment result with the given leases and tasks. + */ + def vmAssignmentResult(hostname: String, + leasesUsed: Seq [VirtualMachineLease] , + tasksAssigned: Set [TaskAssignmentResult] ): VMAssignmentResult = { + new VMAssignmentResult(hostname, leasesUsed.asJava, tasksAssigned.asJava) + } + + /** + * Mock a scheduling result with the given successes and failures. + */ + def schedulingResult(successes: Seq [VMAssignmentResult] , + failures: Seq [TaskAssignmentResult] = Nil, + exceptions: Seq [Exception] = Nil, + leasesAdded: Int = 0, + leasesRejected: Int = 0): SchedulingResult = { + val r = mock(classOf[SchedulingResult]) + when(r.getResultMap).thenReturn(successes.map(r => r.getHostname -> r).toMap.asJava) + when(r.getExceptions).thenReturn(exceptions.asJava) + val groupedFailures = failures.groupBy(_.getRequest).mapValues(_.asJava) + when(r.getFailures).thenReturn(groupedFailures.asJava) + when(r.getLeasesAdded).thenReturn(leasesAdded) + when(r.getLeasesRejected).thenReturn(leasesRejected) + when(r.getRuntime).thenReturn(0) + when(r.getNumAllocations).thenThrow(new NotImplementedError()) + when(r.getTotalVMsCount).thenThrow(new NotImplementedError()) + when(r.getIdleVMsCount).thenThrow(new NotImplementedError()) + r + } + + + /** + * Mock a task scheduler. + * The task assigner/unassigner is pre-wired. + */ + def taskScheduler() = { + val optimizer = mock(classOf[TaskScheduler]) + val taskAssigner = mock(classOf[Action2[TaskRequest, String]]) + when[Action2[TaskRequest, String]](optimizer.getTaskAssigner).thenReturn(taskAssigner) + val taskUnassigner = mock(classOf[Action2[String, String]]) + when[Action2[String, String]](optimizer.getTaskUnAssigner).thenReturn(taskUnassigner) + optimizer + } + + /** + * Create a task scheduler builder. + */ + def taskSchedulerBuilder(optimizer: TaskScheduler) = new TaskSchedulerBuilder { + var leaseRejectAction: Action1 [VirtualMachineLease] = null + override def withLeaseRejectAction(action: Action1 [VirtualMachineLease] ): TaskSchedulerBuilder = { + leaseRejectAction = action + this + } + override def build(): TaskScheduler = optimizer + } + + /** + * Process a call to scheduleOnce with the given function. + */ + def scheduleOnce(f: (Seq [TaskRequest] ,Seq [VirtualMachineLease] ) => SchedulingResult) = { + new Answer [SchedulingResult] { + override def answer(invocationOnMock: InvocationOnMock): SchedulingResult = { + val args = invocationOnMock.getArguments + val requests = args(0).asInstanceOf[java.util.List[TaskRequest]] + val newLeases = args(1).asInstanceOf[java.util.List[VirtualMachineLease]] + f(requests.asScala, newLeases.asScala) + } + } + } + + /** + * The context fixture. + */ + class Context { + val optimizer = taskScheduler() + val optimizerBuilder = taskSchedulerBuilder(optimizer) + val schedulerDriver = mock(classOf [SchedulerDriver] ) + val trace = Mockito.inOrder(schedulerDriver) + val fsm = TestFSMRef(new LaunchCoordinator(testActor, config, schedulerDriver, optimizerBuilder)) + + val framework = randomFramework + val task1 = randomTask + val task2 = randomTask + val task3 = randomTask + + val slave1 = { + val slave = randomSlave + (slave._1, slave._2, randomOffer(framework, slave), randomOffer(framework, slave), randomOffer(framework, slave)) + } + + val slave2 = { + val slave = randomSlave + (slave._1, slave._2, randomOffer(framework, slave), randomOffer(framework, slave), randomOffer(framework, slave)) + } + } + + def inState = afterWord("in state") + def handle = afterWord("handle") + + def handlesAssignments(state: TaskState) = { + "Unassign" which { + s"stays in $state with updated optimizer state" in new Context { — End diff – Very nice testing although I'm not much of a fan of the Scala `WordSpec`
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75301829

          — Diff: flink-mesos/src/main/scala/org/apache/flink/runtime/clusterframework/ContaineredJobManager.scala —
          @@ -0,0 +1,174 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.runtime.clusterframework
          +
          +import java.util.concurrent.

          {TimeUnit, ExecutorService}

          +
          +import akka.actor.ActorRef
          +
          +import org.apache.flink.api.common.JobID
          +import org.apache.flink.configuration.

          {Configuration => FlinkConfiguration, ConfigConstants}

          +import org.apache.flink.runtime.checkpoint.savepoint.SavepointStore
          +import org.apache.flink.runtime.checkpoint.CheckpointRecoveryFactory
          +import org.apache.flink.runtime.clusterframework.ApplicationStatus
          +import org.apache.flink.runtime.executiongraph.restart.RestartStrategyFactory
          +import org.apache.flink.runtime.clusterframework.messages._
          +import org.apache.flink.runtime.jobgraph.JobStatus
          +import org.apache.flink.runtime.jobmanager.

          {SubmittedJobGraphStore, JobManager}

          +import org.apache.flink.runtime.leaderelection.LeaderElectionService
          +import org.apache.flink.runtime.messages.JobManagerMessages.

          {RequestJobStatus, CurrentJobStatus, JobNotFound}

          +import org.apache.flink.runtime.messages.Messages.Acknowledge
          +import org.apache.flink.runtime.metrics.

          {MetricRegistry => FlinkMetricRegistry}

          +import org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager
          +import org.apache.flink.runtime.instance.InstanceManager
          +import org.apache.flink.runtime.jobmanager.scheduler.

          {Scheduler => FlinkScheduler}

          +
          +import scala.concurrent.duration._
          +import scala.language.postfixOps
          +
          +
          +/** JobManager actor for execution on Yarn or Mesos. It enriches the [[JobManager]] with additional messages
          + * to start/administer/stop the session.
          + *
          + * @param flinkConfiguration Configuration object for the actor
          + * @param executorService Execution context which is used to execute concurrent tasks in the
          + * [[org.apache.flink.runtime.executiongraph.ExecutionGraph]]
          + * @param instanceManager Instance manager to manage the registered
          + * [[org.apache.flink.runtime.taskmanager.TaskManager]]
          + * @param scheduler Scheduler to schedule Flink jobs
          + * @param libraryCacheManager Manager to manage uploaded jar files
          + * @param archive Archive for finished Flink jobs
          + * @param restartStrategyFactory Restart strategy to be used in case of a job recovery
          + * @param timeout Timeout for futures
          + * @param leaderElectionService LeaderElectionService to participate in the leader election
          + */
          +abstract class ContaineredJobManager(
          + flinkConfiguration: FlinkConfiguration,
          + executorService: ExecutorService,
          + instanceManager: InstanceManager,
          + scheduler: FlinkScheduler,
          + libraryCacheManager: BlobLibraryCacheManager,
          + archive: ActorRef,
          + restartStrategyFactory: RestartStrategyFactory,
          + timeout: FiniteDuration,
          + leaderElectionService: LeaderElectionService,
          + submittedJobGraphs : SubmittedJobGraphStore,
          + checkpointRecoveryFactory : CheckpointRecoveryFactory,
          + savepointStore: SavepointStore,
          + jobRecoveryTimeout: FiniteDuration,
          + metricsRegistry: Option[FlinkMetricRegistry])
          + extends JobManager(
          + flinkConfiguration,
          + executorService,
          + instanceManager,
          + scheduler,
          + libraryCacheManager,
          + archive,
          + restartStrategyFactory,
          + timeout,
          + leaderElectionService,
          + submittedJobGraphs,
          + checkpointRecoveryFactory,
          + savepointStore,
          + jobRecoveryTimeout,
          + metricsRegistry) {
          +
          + val jobPollingInterval: FiniteDuration
          +
          + // indicates if this JM has been started in a dedicated (per-job) mode.
          + var stopWhenJobFinished: JobID = null
          +
          + override def handleMessage: Receive =

          { + handleContainerMessage orElse super.handleMessage + }

          +
          + def handleContainerMessage: Receive = {
          +
          + case msg @ (_: RegisterInfoMessageListener | _: UnRegisterInfoMessageListener) =>
          + // forward to ResourceManager
          + currentResourceManager match

          { + case Some(rm) => + // we forward the message + rm.forward(decorateMessage(msg)) + case None => + // client has to try again + }

          +
          + case msg: ShutdownClusterAfterJob =>
          + val jobId = msg.jobId()
          + log.info(s"ApplicationMaster will shut down session when job $jobId has finished.")
          + stopWhenJobFinished = jobId
          + // trigger regular job status messages (if this is a dedicated/per-job cluster)
          + if (stopWhenJobFinished != null) {
          + context.system.scheduler.schedule(0 seconds,
          — End diff –

          The polling is a left-over of the old Yarn code. Indeed, would be nicer to apply a hook immediately upon job removal.

          +1 for making `ContaineredJobManager` the base for the Yarn and Mesos JobManager.

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75301829 — Diff: flink-mesos/src/main/scala/org/apache/flink/runtime/clusterframework/ContaineredJobManager.scala — @@ -0,0 +1,174 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.runtime.clusterframework + +import java.util.concurrent. {TimeUnit, ExecutorService} + +import akka.actor.ActorRef + +import org.apache.flink.api.common.JobID +import org.apache.flink.configuration. {Configuration => FlinkConfiguration, ConfigConstants} +import org.apache.flink.runtime.checkpoint.savepoint.SavepointStore +import org.apache.flink.runtime.checkpoint.CheckpointRecoveryFactory +import org.apache.flink.runtime.clusterframework.ApplicationStatus +import org.apache.flink.runtime.executiongraph.restart.RestartStrategyFactory +import org.apache.flink.runtime.clusterframework.messages._ +import org.apache.flink.runtime.jobgraph.JobStatus +import org.apache.flink.runtime.jobmanager. {SubmittedJobGraphStore, JobManager} +import org.apache.flink.runtime.leaderelection.LeaderElectionService +import org.apache.flink.runtime.messages.JobManagerMessages. {RequestJobStatus, CurrentJobStatus, JobNotFound} +import org.apache.flink.runtime.messages.Messages.Acknowledge +import org.apache.flink.runtime.metrics. {MetricRegistry => FlinkMetricRegistry} +import org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager +import org.apache.flink.runtime.instance.InstanceManager +import org.apache.flink.runtime.jobmanager.scheduler. {Scheduler => FlinkScheduler} + +import scala.concurrent.duration._ +import scala.language.postfixOps + + +/** JobManager actor for execution on Yarn or Mesos. It enriches the [ [JobManager] ] with additional messages + * to start/administer/stop the session. + * + * @param flinkConfiguration Configuration object for the actor + * @param executorService Execution context which is used to execute concurrent tasks in the + * [ [org.apache.flink.runtime.executiongraph.ExecutionGraph] ] + * @param instanceManager Instance manager to manage the registered + * [ [org.apache.flink.runtime.taskmanager.TaskManager] ] + * @param scheduler Scheduler to schedule Flink jobs + * @param libraryCacheManager Manager to manage uploaded jar files + * @param archive Archive for finished Flink jobs + * @param restartStrategyFactory Restart strategy to be used in case of a job recovery + * @param timeout Timeout for futures + * @param leaderElectionService LeaderElectionService to participate in the leader election + */ +abstract class ContaineredJobManager( + flinkConfiguration: FlinkConfiguration, + executorService: ExecutorService, + instanceManager: InstanceManager, + scheduler: FlinkScheduler, + libraryCacheManager: BlobLibraryCacheManager, + archive: ActorRef, + restartStrategyFactory: RestartStrategyFactory, + timeout: FiniteDuration, + leaderElectionService: LeaderElectionService, + submittedJobGraphs : SubmittedJobGraphStore, + checkpointRecoveryFactory : CheckpointRecoveryFactory, + savepointStore: SavepointStore, + jobRecoveryTimeout: FiniteDuration, + metricsRegistry: Option [FlinkMetricRegistry] ) + extends JobManager( + flinkConfiguration, + executorService, + instanceManager, + scheduler, + libraryCacheManager, + archive, + restartStrategyFactory, + timeout, + leaderElectionService, + submittedJobGraphs, + checkpointRecoveryFactory, + savepointStore, + jobRecoveryTimeout, + metricsRegistry) { + + val jobPollingInterval: FiniteDuration + + // indicates if this JM has been started in a dedicated (per-job) mode. + var stopWhenJobFinished: JobID = null + + override def handleMessage: Receive = { + handleContainerMessage orElse super.handleMessage + } + + def handleContainerMessage: Receive = { + + case msg @ (_: RegisterInfoMessageListener | _: UnRegisterInfoMessageListener) => + // forward to ResourceManager + currentResourceManager match { + case Some(rm) => + // we forward the message + rm.forward(decorateMessage(msg)) + case None => + // client has to try again + } + + case msg: ShutdownClusterAfterJob => + val jobId = msg.jobId() + log.info(s"ApplicationMaster will shut down session when job $jobId has finished.") + stopWhenJobFinished = jobId + // trigger regular job status messages (if this is a dedicated/per-job cluster) + if (stopWhenJobFinished != null) { + context.system.scheduler.schedule(0 seconds, — End diff – The polling is a left-over of the old Yarn code. Indeed, would be nicer to apply a hook immediately upon job removal. +1 for making `ContaineredJobManager` the base for the Yarn and Mesos JobManager.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75301383

          — Diff: flink-mesos/src/main/scala/org/apache/flink/mesos/scheduler/Tasks.scala —
          @@ -0,0 +1,114 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.scheduler
          +
          +import akka.actor.

          {Actor, ActorRef, Props}

          +import org.apache.flink.configuration.Configuration
          +import org.apache.flink.mesos.scheduler.ReconciliationCoordinator.Reconcile
          +import org.apache.flink.mesos.scheduler.TaskMonitor.

          {TaskGoalState, TaskGoalStateUpdated, TaskTerminated}

          +import org.apache.flink.mesos.scheduler.Tasks._
          +import org.apache.flink.mesos.scheduler.messages._
          +import org.apache.mesos.

          {SchedulerDriver, Protos}

          +
          +import scala.collection.mutable.

          {Map => MutableMap}

          +
          +/**
          + * Aggregate of monitored tasks.
          + *
          + * Routes messages between the scheduler and individual task monitor actors.
          + */
          +class Tasks[M <: TaskMonitor](
          + flinkConfig: Configuration,
          + schedulerDriver: SchedulerDriver,
          + taskMonitorClass: Class[M]) extends Actor {
          +
          + /**
          + * A map of task monitors by task ID.
          + */
          + private val taskMap: MutableMap[Protos.TaskID,ActorRef] = MutableMap()
          +
          + /**
          + * Cache of current connection state.
          + */
          + private var registered: Option[Any] = None
          +
          + override def preStart(): Unit =

          { + // TODO subscribe to context.system.deadLetters for messages to nonexistent tasks + }

          +
          + override def receive: Receive = {
          +
          + case msg: Disconnected =>
          + registered = None
          + context.actorSelection("*").tell(msg, self)
          +
          + case msg : Connected =>
          + registered = Some(msg)
          + context.actorSelection("*").tell(msg, self)
          +
          + case msg: TaskGoalStateUpdated =>
          + val taskID = msg.state.taskID
          +
          + // ensure task monitor exists
          + if(!taskMap.contains(taskID))

          { + val actorRef = createTask(msg.state) + registered.foreach(actorRef ! _) + }

          +
          + taskMap(taskID) ! msg
          +
          + case msg: StatusUpdate =>
          + taskMap(msg.status().getTaskId) ! msg
          +
          + case msg: Reconcile =>
          + context.parent.forward(msg)
          — End diff –

          Same as above. The parent is the resource manager. Do we want to make this explicit?

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75301383 — Diff: flink-mesos/src/main/scala/org/apache/flink/mesos/scheduler/Tasks.scala — @@ -0,0 +1,114 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.scheduler + +import akka.actor. {Actor, ActorRef, Props} +import org.apache.flink.configuration.Configuration +import org.apache.flink.mesos.scheduler.ReconciliationCoordinator.Reconcile +import org.apache.flink.mesos.scheduler.TaskMonitor. {TaskGoalState, TaskGoalStateUpdated, TaskTerminated} +import org.apache.flink.mesos.scheduler.Tasks._ +import org.apache.flink.mesos.scheduler.messages._ +import org.apache.mesos. {SchedulerDriver, Protos} + +import scala.collection.mutable. {Map => MutableMap} + +/** + * Aggregate of monitored tasks. + * + * Routes messages between the scheduler and individual task monitor actors. + */ +class Tasks [M <: TaskMonitor] ( + flinkConfig: Configuration, + schedulerDriver: SchedulerDriver, + taskMonitorClass: Class [M] ) extends Actor { + + /** + * A map of task monitors by task ID. + */ + private val taskMap: MutableMap [Protos.TaskID,ActorRef] = MutableMap() + + /** + * Cache of current connection state. + */ + private var registered: Option [Any] = None + + override def preStart(): Unit = { + // TODO subscribe to context.system.deadLetters for messages to nonexistent tasks + } + + override def receive: Receive = { + + case msg: Disconnected => + registered = None + context.actorSelection("*").tell(msg, self) + + case msg : Connected => + registered = Some(msg) + context.actorSelection("*").tell(msg, self) + + case msg: TaskGoalStateUpdated => + val taskID = msg.state.taskID + + // ensure task monitor exists + if(!taskMap.contains(taskID)) { + val actorRef = createTask(msg.state) + registered.foreach(actorRef ! _) + } + + taskMap(taskID) ! msg + + case msg: StatusUpdate => + taskMap(msg.status().getTaskId) ! msg + + case msg: Reconcile => + context.parent.forward(msg) — End diff – Same as above. The parent is the resource manager. Do we want to make this explicit?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75300879

          — Diff: flink-mesos/src/main/scala/org/apache/flink/mesos/scheduler/Tasks.scala —
          @@ -0,0 +1,114 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.scheduler
          +
          +import akka.actor.

          {Actor, ActorRef, Props}

          +import org.apache.flink.configuration.Configuration
          +import org.apache.flink.mesos.scheduler.ReconciliationCoordinator.Reconcile
          +import org.apache.flink.mesos.scheduler.TaskMonitor.

          {TaskGoalState, TaskGoalStateUpdated, TaskTerminated}

          +import org.apache.flink.mesos.scheduler.Tasks._
          +import org.apache.flink.mesos.scheduler.messages._
          +import org.apache.mesos.

          {SchedulerDriver, Protos}

          +
          +import scala.collection.mutable.

          {Map => MutableMap}

          +
          +/**
          + * Aggregate of monitored tasks.
          + *
          + * Routes messages between the scheduler and individual task monitor actors.
          + */
          +class Tasks[M <: TaskMonitor](
          + flinkConfig: Configuration,
          + schedulerDriver: SchedulerDriver,
          + taskMonitorClass: Class[M]) extends Actor {
          +
          + /**
          + * A map of task monitors by task ID.
          + */
          + private val taskMap: MutableMap[Protos.TaskID,ActorRef] = MutableMap()
          — End diff –

          space after Protos.TaskID,

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75300879 — Diff: flink-mesos/src/main/scala/org/apache/flink/mesos/scheduler/Tasks.scala — @@ -0,0 +1,114 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.scheduler + +import akka.actor. {Actor, ActorRef, Props} +import org.apache.flink.configuration.Configuration +import org.apache.flink.mesos.scheduler.ReconciliationCoordinator.Reconcile +import org.apache.flink.mesos.scheduler.TaskMonitor. {TaskGoalState, TaskGoalStateUpdated, TaskTerminated} +import org.apache.flink.mesos.scheduler.Tasks._ +import org.apache.flink.mesos.scheduler.messages._ +import org.apache.mesos. {SchedulerDriver, Protos} + +import scala.collection.mutable. {Map => MutableMap} + +/** + * Aggregate of monitored tasks. + * + * Routes messages between the scheduler and individual task monitor actors. + */ +class Tasks [M <: TaskMonitor] ( + flinkConfig: Configuration, + schedulerDriver: SchedulerDriver, + taskMonitorClass: Class [M] ) extends Actor { + + /** + * A map of task monitors by task ID. + */ + private val taskMap: MutableMap [Protos.TaskID,ActorRef] = MutableMap() — End diff – space after Protos.TaskID,
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75300491

          — Diff: flink-mesos/src/main/scala/org/apache/flink/mesos/scheduler/TaskMonitor.scala —
          @@ -0,0 +1,258 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.scheduler
          +
          +import grizzled.slf4j.Logger
          +
          +import akka.actor.

          {Actor, FSM, Props}

          +import org.apache.flink.configuration.Configuration
          +import org.apache.flink.mesos.scheduler.ReconciliationCoordinator.Reconcile
          +import org.apache.flink.mesos.scheduler.TaskMonitor._
          +import org.apache.flink.mesos.scheduler.messages.

          {Connected, Disconnected, StatusUpdate}

          +import org.apache.mesos.Protos.TaskState._
          +import org.apache.mesos.

          {SchedulerDriver, Protos}

          +
          +import scala.PartialFunction.empty
          +import scala.concurrent.duration._
          +
          +/**
          + * Monitors a Mesos task throughout its lifecycle.
          + *
          + * Models a task with a state machine reflecting the perceived state of the task in Mesos. The state
          + * is primarily updated when task status information arrives from Mesos.
          + *
          + * The associated state data primarily tracks the task's goal (intended) state, as persisted by the scheduler.
          + * Keep in mind that goal state is persisted before actions are taken. The goal state strictly transitions
          + * thru New->Launched->Released.
          + *
          + * Unlike most exchanges with Mesos, task status is delivered at-least-once, so status handling should be idempotent.
          + */
          +class TaskMonitor(
          + flinkConfig: Configuration,
          + schedulerDriver: SchedulerDriver,
          + goalState: TaskGoalState) extends Actor with FSM[TaskMonitorState,StateData] {
          +
          + val LOG = Logger(getClass)
          +
          + startWith(Suspended, StateData(goalState))
          +
          + // ------------------------------------------------------------------------
          + // Suspended State
          + // ------------------------------------------------------------------------
          +
          + when(Suspended)

          { + case Event(update: TaskGoalStateUpdated, _) => + stay() using StateData(update.state) + case Event(msg: StatusUpdate, _) => + stay() + case Event(msg: Connected, StateData(goal: New)) => + goto(New) + case Event(msg: Connected, StateData(goal: Launched)) => + goto(Reconciling) + case Event(msg: Connected, StateData(goal: Released)) => + goto(Killing) + }

          +
          + // ------------------------------------------------------------------------
          + // New State
          + // ------------------------------------------------------------------------
          +
          + when(New)

          { + case Event(TaskGoalStateUpdated(goal: Launched), _) => + goto(Staging) using StateData(goal) + }

          +
          + // ------------------------------------------------------------------------
          + // Reconciliation State
          + // ------------------------------------------------------------------------
          +
          + onTransition {
          + case _ -> Reconciling =>
          + nextStateData.goal match {
          + case goal: Launched =>
          + val taskStatus = Protos.TaskStatus.newBuilder()
          + .setTaskId(goal.taskID).setSlaveId(goal.slaveID).setState(TASK_STAGING).build()
          + context.parent ! Reconcile(Seq(taskStatus))
          — End diff –

          Would it be cleaner to pass the `ActorRef` directly to the TaskMonitor?

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75300491 — Diff: flink-mesos/src/main/scala/org/apache/flink/mesos/scheduler/TaskMonitor.scala — @@ -0,0 +1,258 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.scheduler + +import grizzled.slf4j.Logger + +import akka.actor. {Actor, FSM, Props} +import org.apache.flink.configuration.Configuration +import org.apache.flink.mesos.scheduler.ReconciliationCoordinator.Reconcile +import org.apache.flink.mesos.scheduler.TaskMonitor._ +import org.apache.flink.mesos.scheduler.messages. {Connected, Disconnected, StatusUpdate} +import org.apache.mesos.Protos.TaskState._ +import org.apache.mesos. {SchedulerDriver, Protos} + +import scala.PartialFunction.empty +import scala.concurrent.duration._ + +/** + * Monitors a Mesos task throughout its lifecycle. + * + * Models a task with a state machine reflecting the perceived state of the task in Mesos. The state + * is primarily updated when task status information arrives from Mesos. + * + * The associated state data primarily tracks the task's goal (intended) state, as persisted by the scheduler. + * Keep in mind that goal state is persisted before actions are taken. The goal state strictly transitions + * thru New->Launched->Released. + * + * Unlike most exchanges with Mesos, task status is delivered at-least-once, so status handling should be idempotent. + */ +class TaskMonitor( + flinkConfig: Configuration, + schedulerDriver: SchedulerDriver, + goalState: TaskGoalState) extends Actor with FSM [TaskMonitorState,StateData] { + + val LOG = Logger(getClass) + + startWith(Suspended, StateData(goalState)) + + // ------------------------------------------------------------------------ + // Suspended State + // ------------------------------------------------------------------------ + + when(Suspended) { + case Event(update: TaskGoalStateUpdated, _) => + stay() using StateData(update.state) + case Event(msg: StatusUpdate, _) => + stay() + case Event(msg: Connected, StateData(goal: New)) => + goto(New) + case Event(msg: Connected, StateData(goal: Launched)) => + goto(Reconciling) + case Event(msg: Connected, StateData(goal: Released)) => + goto(Killing) + } + + // ------------------------------------------------------------------------ + // New State + // ------------------------------------------------------------------------ + + when(New) { + case Event(TaskGoalStateUpdated(goal: Launched), _) => + goto(Staging) using StateData(goal) + } + + // ------------------------------------------------------------------------ + // Reconciliation State + // ------------------------------------------------------------------------ + + onTransition { + case _ -> Reconciling => + nextStateData.goal match { + case goal: Launched => + val taskStatus = Protos.TaskStatus.newBuilder() + .setTaskId(goal.taskID).setSlaveId(goal.slaveID).setState(TASK_STAGING).build() + context.parent ! Reconcile(Seq(taskStatus)) — End diff – Would it be cleaner to pass the `ActorRef` directly to the TaskMonitor?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75299022

          — Diff: flink-mesos/src/main/scala/org/apache/flink/mesos/scheduler/LaunchCoordinator.scala —
          @@ -0,0 +1,349 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.scheduler
          +
          +import akka.actor.

          {Actor, ActorRef, FSM, Props}

          +import com.netflix.fenzo._
          +import com.netflix.fenzo.functions.Action1
          +import com.netflix.fenzo.plugins.VMLeaseObject
          +import grizzled.slf4j.Logger
          +import org.apache.flink.api.java.tuple.

          {Tuple2=>FlinkTuple2}

          +import org.apache.flink.configuration.Configuration
          +import org.apache.flink.mesos.scheduler.LaunchCoordinator._
          +import org.apache.flink.mesos.scheduler.messages._
          +import org.apache.mesos.Protos.TaskInfo
          — End diff –

          Unused import

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75299022 — Diff: flink-mesos/src/main/scala/org/apache/flink/mesos/scheduler/LaunchCoordinator.scala — @@ -0,0 +1,349 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.scheduler + +import akka.actor. {Actor, ActorRef, FSM, Props} +import com.netflix.fenzo._ +import com.netflix.fenzo.functions.Action1 +import com.netflix.fenzo.plugins.VMLeaseObject +import grizzled.slf4j.Logger +import org.apache.flink.api.java.tuple. {Tuple2=>FlinkTuple2} +import org.apache.flink.configuration.Configuration +import org.apache.flink.mesos.scheduler.LaunchCoordinator._ +import org.apache.flink.mesos.scheduler.messages._ +import org.apache.mesos.Protos.TaskInfo — End diff – Unused import
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75298678

          — Diff: flink-mesos/src/main/scala/org/apache/flink/mesos/scheduler/LaunchCoordinator.scala —
          @@ -0,0 +1,349 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.scheduler
          +
          +import akka.actor.

          {Actor, ActorRef, FSM, Props}

          +import com.netflix.fenzo._
          +import com.netflix.fenzo.functions.Action1
          +import com.netflix.fenzo.plugins.VMLeaseObject
          +import grizzled.slf4j.Logger
          +import org.apache.flink.api.java.tuple.

          {Tuple2=>FlinkTuple2}

          +import org.apache.flink.configuration.Configuration
          +import org.apache.flink.mesos.scheduler.LaunchCoordinator._
          +import org.apache.flink.mesos.scheduler.messages._
          +import org.apache.mesos.Protos.TaskInfo
          +import org.apache.mesos.

          {SchedulerDriver, Protos}

          +
          +import scala.collection.JavaConverters._
          +import scala.collection.mutable.

          {Map => MutableMap}

          +import scala.concurrent.duration._
          +
          +/**
          + * The launch coordinator handles offer processing, including
          + * matching offers to tasks and making reservations.
          + *
          + * The coordinator uses Netflix Fenzo to optimize task placement. During the GatheringOffers phase,
          — End diff –

          Fenzo also has my endorsement. It makes sense to delegate scheduling logic to a dedicated library.

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75298678 — Diff: flink-mesos/src/main/scala/org/apache/flink/mesos/scheduler/LaunchCoordinator.scala — @@ -0,0 +1,349 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.scheduler + +import akka.actor. {Actor, ActorRef, FSM, Props} +import com.netflix.fenzo._ +import com.netflix.fenzo.functions.Action1 +import com.netflix.fenzo.plugins.VMLeaseObject +import grizzled.slf4j.Logger +import org.apache.flink.api.java.tuple. {Tuple2=>FlinkTuple2} +import org.apache.flink.configuration.Configuration +import org.apache.flink.mesos.scheduler.LaunchCoordinator._ +import org.apache.flink.mesos.scheduler.messages._ +import org.apache.mesos.Protos.TaskInfo +import org.apache.mesos. {SchedulerDriver, Protos} + +import scala.collection.JavaConverters._ +import scala.collection.mutable. {Map => MutableMap} +import scala.concurrent.duration._ + +/** + * The launch coordinator handles offer processing, including + * matching offers to tasks and making reservations. + * + * The coordinator uses Netflix Fenzo to optimize task placement. During the GatheringOffers phase, — End diff – Fenzo also has my endorsement. It makes sense to delegate scheduling logic to a dedicated library.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75297945

          — Diff: flink-mesos/src/main/scala/org/apache/flink/mesos/scheduler/ConnectionMonitor.scala —
          @@ -0,0 +1,126 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.scheduler
          +
          +import akka.actor.

          {Actor, FSM, Props}

          +import grizzled.slf4j.Logger
          +import org.apache.flink.configuration.Configuration
          +import org.apache.flink.mesos.scheduler.ConnectionMonitor._
          +import org.apache.flink.mesos.scheduler.messages._
          +
          +import scala.concurrent.duration._
          +
          +/**
          + * Actively monitors the Mesos connection.
          + */
          +class ConnectionMonitor() extends Actor with FSM[FsmState, Unit] {
          +
          + val LOG = Logger(getClass)
          +
          + startWith(StoppedState, None)
          +
          + when(StoppedState)

          { + case Event(msg: Start, _) => + LOG.info(s"Connecting to Mesos...") + goto(ConnectingState) + }

          +
          + when(ConnectingState, stateTimeout = CONNECT_RETRY_RATE) {
          + case Event(msg: Stop, _) =>
          + goto(StoppedState)
          +
          + case Event(msg: Registered, _) =>
          + LOG.info(s"Connected to Mesos as framework ID $

          {msg.frameworkId.getValue}

          .")
          + LOG.debug(s" Master Info: $

          {msg.masterInfo}")
          + goto(ConnectedState)
          +
          + case Event(msg: ReRegistered, _) =>
          + LOG.info("Reconnected to a new Mesos master.")
          + LOG.debug(s" Master Info: ${msg.masterInfo}

          ")
          + goto(ConnectedState)
          +
          + case Event(StateTimeout, _) =>
          + LOG.warn("Unable to connect to Mesos; still trying...")
          + stay()
          + }
          +
          + when(ConnectedState)

          { + case Event(msg: Stop, _) => + goto(StoppedState) + + case Event(msg: Disconnected, _) => + LOG.warn("Disconnected from the Mesos master. Reconnecting...") + goto(ConnectingState) + }

          +
          — End diff –

          Would it make sense to add a `whenUnhandled

          {...}

          ` handler here?

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75297945 — Diff: flink-mesos/src/main/scala/org/apache/flink/mesos/scheduler/ConnectionMonitor.scala — @@ -0,0 +1,126 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.scheduler + +import akka.actor. {Actor, FSM, Props} +import grizzled.slf4j.Logger +import org.apache.flink.configuration.Configuration +import org.apache.flink.mesos.scheduler.ConnectionMonitor._ +import org.apache.flink.mesos.scheduler.messages._ + +import scala.concurrent.duration._ + +/** + * Actively monitors the Mesos connection. + */ +class ConnectionMonitor() extends Actor with FSM [FsmState, Unit] { + + val LOG = Logger(getClass) + + startWith(StoppedState, None) + + when(StoppedState) { + case Event(msg: Start, _) => + LOG.info(s"Connecting to Mesos...") + goto(ConnectingState) + } + + when(ConnectingState, stateTimeout = CONNECT_RETRY_RATE) { + case Event(msg: Stop, _) => + goto(StoppedState) + + case Event(msg: Registered, _) => + LOG.info(s"Connected to Mesos as framework ID $ {msg.frameworkId.getValue} .") + LOG.debug(s" Master Info: $ {msg.masterInfo}") + goto(ConnectedState) + + case Event(msg: ReRegistered, _) => + LOG.info("Reconnected to a new Mesos master.") + LOG.debug(s" Master Info: ${msg.masterInfo} ") + goto(ConnectedState) + + case Event(StateTimeout, _) => + LOG.warn("Unable to connect to Mesos; still trying...") + stay() + } + + when(ConnectedState) { + case Event(msg: Stop, _) => + goto(StoppedState) + + case Event(msg: Disconnected, _) => + LOG.warn("Disconnected from the Mesos master. Reconnecting...") + goto(ConnectingState) + } + — End diff – Would it make sense to add a `whenUnhandled {...} ` handler here?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75296757

          — Diff: flink-mesos/src/main/scala/org/apache/flink/runtime/clusterframework/ContaineredJobManager.scala —
          @@ -0,0 +1,174 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.runtime.clusterframework
          +
          +import java.util.concurrent.

          {TimeUnit, ExecutorService}

          +
          +import akka.actor.ActorRef
          +
          +import org.apache.flink.api.common.JobID
          +import org.apache.flink.configuration.

          {Configuration => FlinkConfiguration, ConfigConstants}

          +import org.apache.flink.runtime.checkpoint.savepoint.SavepointStore
          +import org.apache.flink.runtime.checkpoint.CheckpointRecoveryFactory
          +import org.apache.flink.runtime.clusterframework.ApplicationStatus
          +import org.apache.flink.runtime.executiongraph.restart.RestartStrategyFactory
          +import org.apache.flink.runtime.clusterframework.messages._
          +import org.apache.flink.runtime.jobgraph.JobStatus
          +import org.apache.flink.runtime.jobmanager.

          {SubmittedJobGraphStore, JobManager}

          +import org.apache.flink.runtime.leaderelection.LeaderElectionService
          +import org.apache.flink.runtime.messages.JobManagerMessages.

          {RequestJobStatus, CurrentJobStatus, JobNotFound}

          +import org.apache.flink.runtime.messages.Messages.Acknowledge
          +import org.apache.flink.runtime.metrics.

          {MetricRegistry => FlinkMetricRegistry}

          +import org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager
          +import org.apache.flink.runtime.instance.InstanceManager
          +import org.apache.flink.runtime.jobmanager.scheduler.

          {Scheduler => FlinkScheduler}

          +
          +import scala.concurrent.duration._
          +import scala.language.postfixOps
          +
          +
          +/** JobManager actor for execution on Yarn or Mesos. It enriches the [[JobManager]] with additional messages
          + * to start/administer/stop the session.
          — End diff –

          Good idea but this is yet to be integrated in the flink-yarn module.

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75296757 — Diff: flink-mesos/src/main/scala/org/apache/flink/runtime/clusterframework/ContaineredJobManager.scala — @@ -0,0 +1,174 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.runtime.clusterframework + +import java.util.concurrent. {TimeUnit, ExecutorService} + +import akka.actor.ActorRef + +import org.apache.flink.api.common.JobID +import org.apache.flink.configuration. {Configuration => FlinkConfiguration, ConfigConstants} +import org.apache.flink.runtime.checkpoint.savepoint.SavepointStore +import org.apache.flink.runtime.checkpoint.CheckpointRecoveryFactory +import org.apache.flink.runtime.clusterframework.ApplicationStatus +import org.apache.flink.runtime.executiongraph.restart.RestartStrategyFactory +import org.apache.flink.runtime.clusterframework.messages._ +import org.apache.flink.runtime.jobgraph.JobStatus +import org.apache.flink.runtime.jobmanager. {SubmittedJobGraphStore, JobManager} +import org.apache.flink.runtime.leaderelection.LeaderElectionService +import org.apache.flink.runtime.messages.JobManagerMessages. {RequestJobStatus, CurrentJobStatus, JobNotFound} +import org.apache.flink.runtime.messages.Messages.Acknowledge +import org.apache.flink.runtime.metrics. {MetricRegistry => FlinkMetricRegistry} +import org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager +import org.apache.flink.runtime.instance.InstanceManager +import org.apache.flink.runtime.jobmanager.scheduler. {Scheduler => FlinkScheduler} + +import scala.concurrent.duration._ +import scala.language.postfixOps + + +/** JobManager actor for execution on Yarn or Mesos. It enriches the [ [JobManager] ] with additional messages + * to start/administer/stop the session. — End diff – Good idea but this is yet to be integrated in the flink-yarn module.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75296348

          — Diff: flink-mesos/src/main/resources/log4j.properties —
          @@ -0,0 +1,27 @@
          +################################################################################
          +# Licensed to the Apache Software Foundation (ASF) under one
          +# or more contributor license agreements. See the NOTICE file
          +# distributed with this work for additional information
          +# regarding copyright ownership. The ASF licenses this file
          +# to you under the Apache License, Version 2.0 (the
          +# "License"); you may not use this file except in compliance
          +# with the License. You may obtain a copy of the License at
          +#
          +# http://www.apache.org/licenses/LICENSE-2.0
          +#
          +# Unless required by applicable law or agreed to in writing, software
          +# distributed under the License is distributed on an "AS IS" BASIS,
          +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          +# See the License for the specific language governing permissions and
          +# limitations under the License.
          +################################################################################
          +
          +
          +# Convenience file for local debugging of the JobManager/TaskManager.
          +log4j.rootLogger=INFO, console
          +log4j.appender.console=org.apache.log4j.ConsoleAppender
          +log4j.appender.console.layout=org.apache.log4j.PatternLayout
          +log4j.appender.console.layout.ConversionPattern=%d

          {HH:mm:ss,SSS}

          %-5p %-60c %x - %m%n
          +
          +log4j.logger.org.apache.flink.mesos=DEBUG
          +log4j.logger.org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager=INFO
          — End diff –

          Do we want to uncomment these two rules and keep the INFO default?

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75296348 — Diff: flink-mesos/src/main/resources/log4j.properties — @@ -0,0 +1,27 @@ +################################################################################ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +################################################################################ + + +# Convenience file for local debugging of the JobManager/TaskManager. +log4j.rootLogger=INFO, console +log4j.appender.console=org.apache.log4j.ConsoleAppender +log4j.appender.console.layout=org.apache.log4j.PatternLayout +log4j.appender.console.layout.ConversionPattern=%d {HH:mm:ss,SSS} %-5p %-60c %x - %m%n + +log4j.logger.org.apache.flink.mesos=DEBUG +log4j.logger.org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager=INFO — End diff – Do we want to uncomment these two rules and keep the INFO default?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75295536

          — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/util/MesosArtifactServer.java —
          @@ -0,0 +1,304 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.util;
          +
          +import io.netty.bootstrap.ServerBootstrap;
          +import io.netty.buffer.Unpooled;
          +import io.netty.channel.Channel;
          +import io.netty.channel.ChannelFuture;
          +import io.netty.channel.ChannelFutureListener;
          +import io.netty.channel.ChannelHandler;
          +import io.netty.channel.ChannelHandlerContext;
          +import io.netty.channel.ChannelInitializer;
          +import io.netty.channel.DefaultFileRegion;
          +import io.netty.channel.SimpleChannelInboundHandler;
          +import io.netty.channel.nio.NioEventLoopGroup;
          +import io.netty.channel.socket.SocketChannel;
          +import io.netty.channel.socket.nio.NioServerSocketChannel;
          +import io.netty.handler.codec.http.DefaultFullHttpResponse;
          +import io.netty.handler.codec.http.DefaultHttpResponse;
          +import io.netty.handler.codec.http.FullHttpResponse;
          +import io.netty.handler.codec.http.HttpHeaders;
          +import io.netty.handler.codec.http.HttpRequest;
          +import io.netty.handler.codec.http.HttpResponse;
          +import io.netty.handler.codec.http.HttpResponseStatus;
          +import io.netty.handler.codec.http.HttpServerCodec;
          +import io.netty.handler.codec.http.LastHttpContent;
          +import io.netty.handler.codec.http.router.Handler;
          +import io.netty.handler.codec.http.router.Routed;
          +import io.netty.handler.codec.http.router.Router;
          +import io.netty.util.CharsetUtil;
          +import org.jets3t.service.utils.Mimetypes;
          +import org.slf4j.Logger;
          +import org.slf4j.LoggerFactory;
          +
          +import java.io.File;
          +import java.io.FileNotFoundException;
          +import java.io.RandomAccessFile;
          +import java.net.InetSocketAddress;
          +import java.net.MalformedURLException;
          +import java.net.URL;
          +
          +import static io.netty.handler.codec.http.HttpHeaders.Names.CACHE_CONTROL;
          +import static io.netty.handler.codec.http.HttpHeaders.Names.CONNECTION;
          +import static io.netty.handler.codec.http.HttpHeaders.Names.CONTENT_TYPE;
          +import static io.netty.handler.codec.http.HttpMethod.GET;
          +import static io.netty.handler.codec.http.HttpMethod.HEAD;
          +import static io.netty.handler.codec.http.HttpResponseStatus.GONE;
          +import static io.netty.handler.codec.http.HttpResponseStatus.INTERNAL_SERVER_ERROR;
          +import static io.netty.handler.codec.http.HttpResponseStatus.METHOD_NOT_ALLOWED;
          +import static io.netty.handler.codec.http.HttpResponseStatus.NOT_FOUND;
          +import static io.netty.handler.codec.http.HttpResponseStatus.OK;
          +import static io.netty.handler.codec.http.HttpVersion.HTTP_1_1;
          +
          +
          +/**
          + * A generic Mesos artifact server, designed specifically for use by the Mesos Fetcher.
          + *
          + * More information:
          + * http://mesos.apache.org/documentation/latest/fetcher/
          + * http://mesos.apache.org/documentation/latest/fetcher-cache-internals/
          + */
          +public class MesosArtifactServer {
          +
          + private static final Logger LOG = LoggerFactory.getLogger(MesosArtifactServer.class);
          +
          + private final Router router;
          +
          + private ServerBootstrap bootstrap;
          +
          + private Channel serverChannel;
          +
          + private URL baseURL;
          +
          + public MesosArtifactServer(String sessionID, String serverHostname, int configuredPort) throws Exception {
          + if (configuredPort < 0 || configuredPort > 0xFFFF)

          { + throw new IllegalArgumentException("File server port is invalid: " + configuredPort); + }

          +
          + router = new Router();
          +
          + ChannelInitializer<SocketChannel> initializer = new ChannelInitializer<SocketChannel>() {
          +
          + @Override
          + protected void initChannel(SocketChannel ch)

          { + Handler handler = new Handler(router); + + ch.pipeline() + .addLast(new HttpServerCodec()) + .addLast(handler.name(), handler) + .addLast(new UnknownFileHandler()); + }

          + };
          +
          + NioEventLoopGroup bossGroup = new NioEventLoopGroup(1);
          + NioEventLoopGroup workerGroup = new NioEventLoopGroup();
          +
          + this.bootstrap = new ServerBootstrap();
          + this.bootstrap
          + .group(bossGroup, workerGroup)
          + .channel(NioServerSocketChannel.class)
          + .childHandler(initializer);
          +
          + Channel ch = this.bootstrap.bind(serverHostname, configuredPort).sync().channel();
          + this.serverChannel = ch;
          +
          + InetSocketAddress bindAddress = (InetSocketAddress) ch.localAddress();
          + String address = bindAddress.getAddress().getHostAddress();
          + int port = bindAddress.getPort();
          +
          + baseURL = new URL("http", serverHostname, port, "/" + sessionID + "/");
          +
          + LOG.info("Mesos artifact server listening at " + address + ':' + port);
          + }
          +
          + /**
          + * Get the server port on which the artifact server is listening.
          + */
          + public synchronized int getServerPort() {
          + Channel server = this.serverChannel;
          + if (server != null) {
          + try

          { + return ((InetSocketAddress) server.localAddress()).getPort(); + }

          catch (Exception e)

          { + LOG.error("Cannot access local server port", e); + }

          + }
          + return -1;
          + }
          +
          + /**
          + * Adds a file to the artifact server.
          + * @param localFile the local file to serve.
          + * @param remoteFile the remote path with which to locate the file.
          + * @return the fully-qualified remote path to the file.
          + * @throws MalformedURLException if the remote path is invalid.
          + */
          + public synchronized URL addFile(File localFile, String remoteFile) throws MalformedURLException {
          — End diff –

          `synchronized` really necessary? Should only be called by a single process, i.e. the mesos master runner

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75295536 — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/util/MesosArtifactServer.java — @@ -0,0 +1,304 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.util; + +import io.netty.bootstrap.ServerBootstrap; +import io.netty.buffer.Unpooled; +import io.netty.channel.Channel; +import io.netty.channel.ChannelFuture; +import io.netty.channel.ChannelFutureListener; +import io.netty.channel.ChannelHandler; +import io.netty.channel.ChannelHandlerContext; +import io.netty.channel.ChannelInitializer; +import io.netty.channel.DefaultFileRegion; +import io.netty.channel.SimpleChannelInboundHandler; +import io.netty.channel.nio.NioEventLoopGroup; +import io.netty.channel.socket.SocketChannel; +import io.netty.channel.socket.nio.NioServerSocketChannel; +import io.netty.handler.codec.http.DefaultFullHttpResponse; +import io.netty.handler.codec.http.DefaultHttpResponse; +import io.netty.handler.codec.http.FullHttpResponse; +import io.netty.handler.codec.http.HttpHeaders; +import io.netty.handler.codec.http.HttpRequest; +import io.netty.handler.codec.http.HttpResponse; +import io.netty.handler.codec.http.HttpResponseStatus; +import io.netty.handler.codec.http.HttpServerCodec; +import io.netty.handler.codec.http.LastHttpContent; +import io.netty.handler.codec.http.router.Handler; +import io.netty.handler.codec.http.router.Routed; +import io.netty.handler.codec.http.router.Router; +import io.netty.util.CharsetUtil; +import org.jets3t.service.utils.Mimetypes; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.File; +import java.io.FileNotFoundException; +import java.io.RandomAccessFile; +import java.net.InetSocketAddress; +import java.net.MalformedURLException; +import java.net.URL; + +import static io.netty.handler.codec.http.HttpHeaders.Names.CACHE_CONTROL; +import static io.netty.handler.codec.http.HttpHeaders.Names.CONNECTION; +import static io.netty.handler.codec.http.HttpHeaders.Names.CONTENT_TYPE; +import static io.netty.handler.codec.http.HttpMethod.GET; +import static io.netty.handler.codec.http.HttpMethod.HEAD; +import static io.netty.handler.codec.http.HttpResponseStatus.GONE; +import static io.netty.handler.codec.http.HttpResponseStatus.INTERNAL_SERVER_ERROR; +import static io.netty.handler.codec.http.HttpResponseStatus.METHOD_NOT_ALLOWED; +import static io.netty.handler.codec.http.HttpResponseStatus.NOT_FOUND; +import static io.netty.handler.codec.http.HttpResponseStatus.OK; +import static io.netty.handler.codec.http.HttpVersion.HTTP_1_1; + + +/** + * A generic Mesos artifact server, designed specifically for use by the Mesos Fetcher. + * + * More information: + * http://mesos.apache.org/documentation/latest/fetcher/ + * http://mesos.apache.org/documentation/latest/fetcher-cache-internals/ + */ +public class MesosArtifactServer { + + private static final Logger LOG = LoggerFactory.getLogger(MesosArtifactServer.class); + + private final Router router; + + private ServerBootstrap bootstrap; + + private Channel serverChannel; + + private URL baseURL; + + public MesosArtifactServer(String sessionID, String serverHostname, int configuredPort) throws Exception { + if (configuredPort < 0 || configuredPort > 0xFFFF) { + throw new IllegalArgumentException("File server port is invalid: " + configuredPort); + } + + router = new Router(); + + ChannelInitializer<SocketChannel> initializer = new ChannelInitializer<SocketChannel>() { + + @Override + protected void initChannel(SocketChannel ch) { + Handler handler = new Handler(router); + + ch.pipeline() + .addLast(new HttpServerCodec()) + .addLast(handler.name(), handler) + .addLast(new UnknownFileHandler()); + } + }; + + NioEventLoopGroup bossGroup = new NioEventLoopGroup(1); + NioEventLoopGroup workerGroup = new NioEventLoopGroup(); + + this.bootstrap = new ServerBootstrap(); + this.bootstrap + .group(bossGroup, workerGroup) + .channel(NioServerSocketChannel.class) + .childHandler(initializer); + + Channel ch = this.bootstrap.bind(serverHostname, configuredPort).sync().channel(); + this.serverChannel = ch; + + InetSocketAddress bindAddress = (InetSocketAddress) ch.localAddress(); + String address = bindAddress.getAddress().getHostAddress(); + int port = bindAddress.getPort(); + + baseURL = new URL("http", serverHostname, port, "/" + sessionID + "/"); + + LOG.info("Mesos artifact server listening at " + address + ':' + port); + } + + /** + * Get the server port on which the artifact server is listening. + */ + public synchronized int getServerPort() { + Channel server = this.serverChannel; + if (server != null) { + try { + return ((InetSocketAddress) server.localAddress()).getPort(); + } catch (Exception e) { + LOG.error("Cannot access local server port", e); + } + } + return -1; + } + + /** + * Adds a file to the artifact server. + * @param localFile the local file to serve. + * @param remoteFile the remote path with which to locate the file. + * @return the fully-qualified remote path to the file. + * @throws MalformedURLException if the remote path is invalid. + */ + public synchronized URL addFile(File localFile, String remoteFile) throws MalformedURLException { — End diff – `synchronized` really necessary? Should only be called by a single process, i.e. the mesos master runner
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75295468

          — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/util/MesosArtifactServer.java —
          @@ -0,0 +1,304 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.util;
          +
          +import io.netty.bootstrap.ServerBootstrap;
          +import io.netty.buffer.Unpooled;
          +import io.netty.channel.Channel;
          +import io.netty.channel.ChannelFuture;
          +import io.netty.channel.ChannelFutureListener;
          +import io.netty.channel.ChannelHandler;
          +import io.netty.channel.ChannelHandlerContext;
          +import io.netty.channel.ChannelInitializer;
          +import io.netty.channel.DefaultFileRegion;
          +import io.netty.channel.SimpleChannelInboundHandler;
          +import io.netty.channel.nio.NioEventLoopGroup;
          +import io.netty.channel.socket.SocketChannel;
          +import io.netty.channel.socket.nio.NioServerSocketChannel;
          +import io.netty.handler.codec.http.DefaultFullHttpResponse;
          +import io.netty.handler.codec.http.DefaultHttpResponse;
          +import io.netty.handler.codec.http.FullHttpResponse;
          +import io.netty.handler.codec.http.HttpHeaders;
          +import io.netty.handler.codec.http.HttpRequest;
          +import io.netty.handler.codec.http.HttpResponse;
          +import io.netty.handler.codec.http.HttpResponseStatus;
          +import io.netty.handler.codec.http.HttpServerCodec;
          +import io.netty.handler.codec.http.LastHttpContent;
          +import io.netty.handler.codec.http.router.Handler;
          +import io.netty.handler.codec.http.router.Routed;
          +import io.netty.handler.codec.http.router.Router;
          +import io.netty.util.CharsetUtil;
          +import org.jets3t.service.utils.Mimetypes;
          +import org.slf4j.Logger;
          +import org.slf4j.LoggerFactory;
          +
          +import java.io.File;
          +import java.io.FileNotFoundException;
          +import java.io.RandomAccessFile;
          +import java.net.InetSocketAddress;
          +import java.net.MalformedURLException;
          +import java.net.URL;
          +
          +import static io.netty.handler.codec.http.HttpHeaders.Names.CACHE_CONTROL;
          +import static io.netty.handler.codec.http.HttpHeaders.Names.CONNECTION;
          +import static io.netty.handler.codec.http.HttpHeaders.Names.CONTENT_TYPE;
          +import static io.netty.handler.codec.http.HttpMethod.GET;
          +import static io.netty.handler.codec.http.HttpMethod.HEAD;
          +import static io.netty.handler.codec.http.HttpResponseStatus.GONE;
          +import static io.netty.handler.codec.http.HttpResponseStatus.INTERNAL_SERVER_ERROR;
          +import static io.netty.handler.codec.http.HttpResponseStatus.METHOD_NOT_ALLOWED;
          +import static io.netty.handler.codec.http.HttpResponseStatus.NOT_FOUND;
          +import static io.netty.handler.codec.http.HttpResponseStatus.OK;
          +import static io.netty.handler.codec.http.HttpVersion.HTTP_1_1;
          +
          +
          +/**
          + * A generic Mesos artifact server, designed specifically for use by the Mesos Fetcher.
          + *
          + * More information:
          + * http://mesos.apache.org/documentation/latest/fetcher/
          + * http://mesos.apache.org/documentation/latest/fetcher-cache-internals/
          + */
          +public class MesosArtifactServer {
          +
          + private static final Logger LOG = LoggerFactory.getLogger(MesosArtifactServer.class);
          +
          + private final Router router;
          +
          + private ServerBootstrap bootstrap;
          +
          + private Channel serverChannel;
          +
          + private URL baseURL;
          +
          + public MesosArtifactServer(String sessionID, String serverHostname, int configuredPort) throws Exception {
          + if (configuredPort < 0 || configuredPort > 0xFFFF)

          { + throw new IllegalArgumentException("File server port is invalid: " + configuredPort); + }

          +
          + router = new Router();
          +
          + ChannelInitializer<SocketChannel> initializer = new ChannelInitializer<SocketChannel>() {
          +
          + @Override
          + protected void initChannel(SocketChannel ch)

          { + Handler handler = new Handler(router); + + ch.pipeline() + .addLast(new HttpServerCodec()) + .addLast(handler.name(), handler) + .addLast(new UnknownFileHandler()); + }

          + };
          +
          + NioEventLoopGroup bossGroup = new NioEventLoopGroup(1);
          + NioEventLoopGroup workerGroup = new NioEventLoopGroup();
          +
          + this.bootstrap = new ServerBootstrap();
          + this.bootstrap
          + .group(bossGroup, workerGroup)
          + .channel(NioServerSocketChannel.class)
          + .childHandler(initializer);
          +
          + Channel ch = this.bootstrap.bind(serverHostname, configuredPort).sync().channel();
          + this.serverChannel = ch;
          +
          + InetSocketAddress bindAddress = (InetSocketAddress) ch.localAddress();
          + String address = bindAddress.getAddress().getHostAddress();
          + int port = bindAddress.getPort();
          +
          + baseURL = new URL("http", serverHostname, port, "/" + sessionID + "/");
          +
          + LOG.info("Mesos artifact server listening at " + address + ':' + port);
          + }
          +
          + /**
          + * Get the server port on which the artifact server is listening.
          + */
          + public synchronized int getServerPort() {
          — End diff –

          unused method but we'll probably need it at a later point in time.

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75295468 — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/util/MesosArtifactServer.java — @@ -0,0 +1,304 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.util; + +import io.netty.bootstrap.ServerBootstrap; +import io.netty.buffer.Unpooled; +import io.netty.channel.Channel; +import io.netty.channel.ChannelFuture; +import io.netty.channel.ChannelFutureListener; +import io.netty.channel.ChannelHandler; +import io.netty.channel.ChannelHandlerContext; +import io.netty.channel.ChannelInitializer; +import io.netty.channel.DefaultFileRegion; +import io.netty.channel.SimpleChannelInboundHandler; +import io.netty.channel.nio.NioEventLoopGroup; +import io.netty.channel.socket.SocketChannel; +import io.netty.channel.socket.nio.NioServerSocketChannel; +import io.netty.handler.codec.http.DefaultFullHttpResponse; +import io.netty.handler.codec.http.DefaultHttpResponse; +import io.netty.handler.codec.http.FullHttpResponse; +import io.netty.handler.codec.http.HttpHeaders; +import io.netty.handler.codec.http.HttpRequest; +import io.netty.handler.codec.http.HttpResponse; +import io.netty.handler.codec.http.HttpResponseStatus; +import io.netty.handler.codec.http.HttpServerCodec; +import io.netty.handler.codec.http.LastHttpContent; +import io.netty.handler.codec.http.router.Handler; +import io.netty.handler.codec.http.router.Routed; +import io.netty.handler.codec.http.router.Router; +import io.netty.util.CharsetUtil; +import org.jets3t.service.utils.Mimetypes; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.File; +import java.io.FileNotFoundException; +import java.io.RandomAccessFile; +import java.net.InetSocketAddress; +import java.net.MalformedURLException; +import java.net.URL; + +import static io.netty.handler.codec.http.HttpHeaders.Names.CACHE_CONTROL; +import static io.netty.handler.codec.http.HttpHeaders.Names.CONNECTION; +import static io.netty.handler.codec.http.HttpHeaders.Names.CONTENT_TYPE; +import static io.netty.handler.codec.http.HttpMethod.GET; +import static io.netty.handler.codec.http.HttpMethod.HEAD; +import static io.netty.handler.codec.http.HttpResponseStatus.GONE; +import static io.netty.handler.codec.http.HttpResponseStatus.INTERNAL_SERVER_ERROR; +import static io.netty.handler.codec.http.HttpResponseStatus.METHOD_NOT_ALLOWED; +import static io.netty.handler.codec.http.HttpResponseStatus.NOT_FOUND; +import static io.netty.handler.codec.http.HttpResponseStatus.OK; +import static io.netty.handler.codec.http.HttpVersion.HTTP_1_1; + + +/** + * A generic Mesos artifact server, designed specifically for use by the Mesos Fetcher. + * + * More information: + * http://mesos.apache.org/documentation/latest/fetcher/ + * http://mesos.apache.org/documentation/latest/fetcher-cache-internals/ + */ +public class MesosArtifactServer { + + private static final Logger LOG = LoggerFactory.getLogger(MesosArtifactServer.class); + + private final Router router; + + private ServerBootstrap bootstrap; + + private Channel serverChannel; + + private URL baseURL; + + public MesosArtifactServer(String sessionID, String serverHostname, int configuredPort) throws Exception { + if (configuredPort < 0 || configuredPort > 0xFFFF) { + throw new IllegalArgumentException("File server port is invalid: " + configuredPort); + } + + router = new Router(); + + ChannelInitializer<SocketChannel> initializer = new ChannelInitializer<SocketChannel>() { + + @Override + protected void initChannel(SocketChannel ch) { + Handler handler = new Handler(router); + + ch.pipeline() + .addLast(new HttpServerCodec()) + .addLast(handler.name(), handler) + .addLast(new UnknownFileHandler()); + } + }; + + NioEventLoopGroup bossGroup = new NioEventLoopGroup(1); + NioEventLoopGroup workerGroup = new NioEventLoopGroup(); + + this.bootstrap = new ServerBootstrap(); + this.bootstrap + .group(bossGroup, workerGroup) + .channel(NioServerSocketChannel.class) + .childHandler(initializer); + + Channel ch = this.bootstrap.bind(serverHostname, configuredPort).sync().channel(); + this.serverChannel = ch; + + InetSocketAddress bindAddress = (InetSocketAddress) ch.localAddress(); + String address = bindAddress.getAddress().getHostAddress(); + int port = bindAddress.getPort(); + + baseURL = new URL("http", serverHostname, port, "/" + sessionID + "/"); + + LOG.info("Mesos artifact server listening at " + address + ':' + port); + } + + /** + * Get the server port on which the artifact server is listening. + */ + public synchronized int getServerPort() { — End diff – unused method but we'll probably need it at a later point in time.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75285727

          — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/scheduler/TaskSchedulerBuilder.java —
          @@ -0,0 +1,34 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.scheduler;
          +
          +import com.netflix.fenzo.TaskScheduler;
          +import com.netflix.fenzo.VirtualMachineLease;
          +import com.netflix.fenzo.functions.Action1;
          +
          +/**
          + * A builder for the Fenzo task scheduler.
          + *
          + * Note that the Fenzo-provided

          {@link TaskScheduler.Builder}

          cannot be mocked, which motivates this interface.
          + */
          +public interface TaskSchedulerBuilder {
          + TaskSchedulerBuilder withLeaseRejectAction(Action1<VirtualMachineLease> action);
          — End diff –

          new line would be nice

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75285727 — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/scheduler/TaskSchedulerBuilder.java — @@ -0,0 +1,34 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.scheduler; + +import com.netflix.fenzo.TaskScheduler; +import com.netflix.fenzo.VirtualMachineLease; +import com.netflix.fenzo.functions.Action1; + +/** + * A builder for the Fenzo task scheduler. + * + * Note that the Fenzo-provided {@link TaskScheduler.Builder} cannot be mocked, which motivates this interface. + */ +public interface TaskSchedulerBuilder { + TaskSchedulerBuilder withLeaseRejectAction(Action1<VirtualMachineLease> action); — End diff – new line would be nice
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75285623

          — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/scheduler/SchedulerProxy.java —
          @@ -0,0 +1,105 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.scheduler;
          +
          +import akka.actor.ActorRef;
          +
          +import org.apache.flink.mesos.scheduler.messages.Disconnected;
          +import org.apache.flink.mesos.scheduler.messages.Error;
          +import org.apache.flink.mesos.scheduler.messages.Error;
          +import org.apache.flink.mesos.scheduler.messages.OfferRescinded;
          +import org.apache.flink.mesos.scheduler.messages.ReRegistered;
          +import org.apache.flink.mesos.scheduler.messages.Registered;
          +import org.apache.flink.mesos.scheduler.messages.ResourceOffers;
          +import org.apache.flink.mesos.scheduler.messages.SlaveLost;
          +import org.apache.flink.mesos.scheduler.messages.StatusUpdate;
          +import org.apache.mesos.Protos;
          +import org.apache.mesos.Scheduler;
          +import org.apache.mesos.SchedulerDriver;
          +
          +import java.util.List;
          +
          +/**
          + * This class reacts to callbacks from the Mesos scheduler driver.
          + *
          + * In order to preserve actor concurrency safety, this class simply sends
          + * corresponding messages to the Mesos resource master actor.
          + *
          + * See https://mesos.apache.org/api/latest/java/org/apache/mesos/Scheduler.html
          + */
          +public class SchedulerProxy implements Scheduler {
          +
          + /** The actor to which we report the callbacks */
          + private ActorRef mesosActor;
          +
          + public SchedulerProxy(ActorRef mesosActor)

          { + this.mesosActor = mesosActor; + }

          +
          + @Override
          + public void registered(SchedulerDriver driver, Protos.FrameworkID frameworkId, Protos.MasterInfo masterInfo)

          { + mesosActor.tell(new Registered(frameworkId, masterInfo), ActorRef.noSender()); + }

          +
          + @Override
          + public void reregistered(SchedulerDriver driver, Protos.MasterInfo masterInfo)

          { + mesosActor.tell(new ReRegistered(masterInfo), ActorRef.noSender()); + }

          +
          + @Override
          + public void disconnected(SchedulerDriver driver)

          { + mesosActor.tell(new Disconnected(), ActorRef.noSender()); + }

          +
          +
          + @Override
          + public void resourceOffers(SchedulerDriver driver, List<Protos.Offer> offers)

          { + mesosActor.tell(new ResourceOffers(offers), ActorRef.noSender()); + }

          +
          + @Override
          + public void offerRescinded(SchedulerDriver driver, Protos.OfferID offerId)

          { + mesosActor.tell(new OfferRescinded(offerId), ActorRef.noSender()); + }

          +
          + @Override
          + public void statusUpdate(SchedulerDriver driver, Protos.TaskStatus status)

          { + mesosActor.tell(new StatusUpdate(status), ActorRef.noSender()); + }

          +
          + @Override
          + public void frameworkMessage(SchedulerDriver driver, Protos.ExecutorID executorId, Protos.SlaveID slaveId, byte[] data) {
          + throw new UnsupportedOperationException("frameworkMessage is unexpected");
          — End diff –

          What other messages could the framework send? Is it worth crashing the actor?

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75285623 — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/scheduler/SchedulerProxy.java — @@ -0,0 +1,105 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.scheduler; + +import akka.actor.ActorRef; + +import org.apache.flink.mesos.scheduler.messages.Disconnected; +import org.apache.flink.mesos.scheduler.messages.Error; +import org.apache.flink.mesos.scheduler.messages.Error; +import org.apache.flink.mesos.scheduler.messages.OfferRescinded; +import org.apache.flink.mesos.scheduler.messages.ReRegistered; +import org.apache.flink.mesos.scheduler.messages.Registered; +import org.apache.flink.mesos.scheduler.messages.ResourceOffers; +import org.apache.flink.mesos.scheduler.messages.SlaveLost; +import org.apache.flink.mesos.scheduler.messages.StatusUpdate; +import org.apache.mesos.Protos; +import org.apache.mesos.Scheduler; +import org.apache.mesos.SchedulerDriver; + +import java.util.List; + +/** + * This class reacts to callbacks from the Mesos scheduler driver. + * + * In order to preserve actor concurrency safety, this class simply sends + * corresponding messages to the Mesos resource master actor. + * + * See https://mesos.apache.org/api/latest/java/org/apache/mesos/Scheduler.html + */ +public class SchedulerProxy implements Scheduler { + + /** The actor to which we report the callbacks */ + private ActorRef mesosActor; + + public SchedulerProxy(ActorRef mesosActor) { + this.mesosActor = mesosActor; + } + + @Override + public void registered(SchedulerDriver driver, Protos.FrameworkID frameworkId, Protos.MasterInfo masterInfo) { + mesosActor.tell(new Registered(frameworkId, masterInfo), ActorRef.noSender()); + } + + @Override + public void reregistered(SchedulerDriver driver, Protos.MasterInfo masterInfo) { + mesosActor.tell(new ReRegistered(masterInfo), ActorRef.noSender()); + } + + @Override + public void disconnected(SchedulerDriver driver) { + mesosActor.tell(new Disconnected(), ActorRef.noSender()); + } + + + @Override + public void resourceOffers(SchedulerDriver driver, List<Protos.Offer> offers) { + mesosActor.tell(new ResourceOffers(offers), ActorRef.noSender()); + } + + @Override + public void offerRescinded(SchedulerDriver driver, Protos.OfferID offerId) { + mesosActor.tell(new OfferRescinded(offerId), ActorRef.noSender()); + } + + @Override + public void statusUpdate(SchedulerDriver driver, Protos.TaskStatus status) { + mesosActor.tell(new StatusUpdate(status), ActorRef.noSender()); + } + + @Override + public void frameworkMessage(SchedulerDriver driver, Protos.ExecutorID executorId, Protos.SlaveID slaveId, byte[] data) { + throw new UnsupportedOperationException("frameworkMessage is unexpected"); — End diff – What other messages could the framework send? Is it worth crashing the actor?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75285550

          — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/scheduler/SchedulerProxy.java —
          @@ -0,0 +1,105 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.scheduler;
          +
          +import akka.actor.ActorRef;
          +
          +import org.apache.flink.mesos.scheduler.messages.Disconnected;
          +import org.apache.flink.mesos.scheduler.messages.Error;
          +import org.apache.flink.mesos.scheduler.messages.Error;
          +import org.apache.flink.mesos.scheduler.messages.OfferRescinded;
          +import org.apache.flink.mesos.scheduler.messages.ReRegistered;
          +import org.apache.flink.mesos.scheduler.messages.Registered;
          +import org.apache.flink.mesos.scheduler.messages.ResourceOffers;
          +import org.apache.flink.mesos.scheduler.messages.SlaveLost;
          +import org.apache.flink.mesos.scheduler.messages.StatusUpdate;
          +import org.apache.mesos.Protos;
          +import org.apache.mesos.Scheduler;
          +import org.apache.mesos.SchedulerDriver;
          +
          +import java.util.List;
          +
          +/**
          + * This class reacts to callbacks from the Mesos scheduler driver.
          + *
          + * In order to preserve actor concurrency safety, this class simply sends
          + * corresponding messages to the Mesos resource master actor.
          + *
          + * See https://mesos.apache.org/api/latest/java/org/apache/mesos/Scheduler.html
          + */
          +public class SchedulerProxy implements Scheduler {
          +
          + /** The actor to which we report the callbacks */
          + private ActorRef mesosActor;
          +
          + public SchedulerProxy(ActorRef mesosActor)

          { + this.mesosActor = mesosActor; + }

          +
          + @Override
          + public void registered(SchedulerDriver driver, Protos.FrameworkID frameworkId, Protos.MasterInfo masterInfo)

          { + mesosActor.tell(new Registered(frameworkId, masterInfo), ActorRef.noSender()); + }

          +
          + @Override
          + public void reregistered(SchedulerDriver driver, Protos.MasterInfo masterInfo)

          { + mesosActor.tell(new ReRegistered(masterInfo), ActorRef.noSender()); + }

          +
          + @Override
          + public void disconnected(SchedulerDriver driver)

          { + mesosActor.tell(new Disconnected(), ActorRef.noSender()); + }

          +
          +
          + @Override
          + public void resourceOffers(SchedulerDriver driver, List<Protos.Offer> offers)

          { + mesosActor.tell(new ResourceOffers(offers), ActorRef.noSender()); + }

          +
          + @Override
          + public void offerRescinded(SchedulerDriver driver, Protos.OfferID offerId)

          { + mesosActor.tell(new OfferRescinded(offerId), ActorRef.noSender()); + }

          +
          + @Override
          + public void statusUpdate(SchedulerDriver driver, Protos.TaskStatus status)

          { + mesosActor.tell(new StatusUpdate(status), ActorRef.noSender()); + }

          +
          + @Override
          + public void frameworkMessage(SchedulerDriver driver, Protos.ExecutorID executorId, Protos.SlaveID slaveId, byte[] data)

          { + throw new UnsupportedOperationException("frameworkMessage is unexpected"); + }

          +
          + @Override
          + public void slaveLost(SchedulerDriver driver, Protos.SlaveID slaveId)

          { + mesosActor.tell(new SlaveLost(slaveId), ActorRef.noSender()); + }

          +
          + @Override
          + public void executorLost(SchedulerDriver driver, Protos.ExecutorID executorId, Protos.SlaveID slaveId, int status) {
          + throw new UnsupportedOperationException("executorLost is unexpected");
          — End diff –

          Why don't we forward this message and crash the actor instead?

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75285550 — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/scheduler/SchedulerProxy.java — @@ -0,0 +1,105 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.scheduler; + +import akka.actor.ActorRef; + +import org.apache.flink.mesos.scheduler.messages.Disconnected; +import org.apache.flink.mesos.scheduler.messages.Error; +import org.apache.flink.mesos.scheduler.messages.Error; +import org.apache.flink.mesos.scheduler.messages.OfferRescinded; +import org.apache.flink.mesos.scheduler.messages.ReRegistered; +import org.apache.flink.mesos.scheduler.messages.Registered; +import org.apache.flink.mesos.scheduler.messages.ResourceOffers; +import org.apache.flink.mesos.scheduler.messages.SlaveLost; +import org.apache.flink.mesos.scheduler.messages.StatusUpdate; +import org.apache.mesos.Protos; +import org.apache.mesos.Scheduler; +import org.apache.mesos.SchedulerDriver; + +import java.util.List; + +/** + * This class reacts to callbacks from the Mesos scheduler driver. + * + * In order to preserve actor concurrency safety, this class simply sends + * corresponding messages to the Mesos resource master actor. + * + * See https://mesos.apache.org/api/latest/java/org/apache/mesos/Scheduler.html + */ +public class SchedulerProxy implements Scheduler { + + /** The actor to which we report the callbacks */ + private ActorRef mesosActor; + + public SchedulerProxy(ActorRef mesosActor) { + this.mesosActor = mesosActor; + } + + @Override + public void registered(SchedulerDriver driver, Protos.FrameworkID frameworkId, Protos.MasterInfo masterInfo) { + mesosActor.tell(new Registered(frameworkId, masterInfo), ActorRef.noSender()); + } + + @Override + public void reregistered(SchedulerDriver driver, Protos.MasterInfo masterInfo) { + mesosActor.tell(new ReRegistered(masterInfo), ActorRef.noSender()); + } + + @Override + public void disconnected(SchedulerDriver driver) { + mesosActor.tell(new Disconnected(), ActorRef.noSender()); + } + + + @Override + public void resourceOffers(SchedulerDriver driver, List<Protos.Offer> offers) { + mesosActor.tell(new ResourceOffers(offers), ActorRef.noSender()); + } + + @Override + public void offerRescinded(SchedulerDriver driver, Protos.OfferID offerId) { + mesosActor.tell(new OfferRescinded(offerId), ActorRef.noSender()); + } + + @Override + public void statusUpdate(SchedulerDriver driver, Protos.TaskStatus status) { + mesosActor.tell(new StatusUpdate(status), ActorRef.noSender()); + } + + @Override + public void frameworkMessage(SchedulerDriver driver, Protos.ExecutorID executorId, Protos.SlaveID slaveId, byte[] data) { + throw new UnsupportedOperationException("frameworkMessage is unexpected"); + } + + @Override + public void slaveLost(SchedulerDriver driver, Protos.SlaveID slaveId) { + mesosActor.tell(new SlaveLost(slaveId), ActorRef.noSender()); + } + + @Override + public void executorLost(SchedulerDriver driver, Protos.ExecutorID executorId, Protos.SlaveID slaveId, int status) { + throw new UnsupportedOperationException("executorLost is unexpected"); — End diff – Why don't we forward this message and crash the actor instead?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75284993

          — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/scheduler/SchedulerProxy.java —
          @@ -0,0 +1,105 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.scheduler;
          +
          +import akka.actor.ActorRef;
          +
          +import org.apache.flink.mesos.scheduler.messages.Disconnected;
          +import org.apache.flink.mesos.scheduler.messages.Error;
          +import org.apache.flink.mesos.scheduler.messages.Error;
          +import org.apache.flink.mesos.scheduler.messages.OfferRescinded;
          +import org.apache.flink.mesos.scheduler.messages.ReRegistered;
          +import org.apache.flink.mesos.scheduler.messages.Registered;
          +import org.apache.flink.mesos.scheduler.messages.ResourceOffers;
          +import org.apache.flink.mesos.scheduler.messages.SlaveLost;
          +import org.apache.flink.mesos.scheduler.messages.StatusUpdate;
          +import org.apache.mesos.Protos;
          +import org.apache.mesos.Scheduler;
          +import org.apache.mesos.SchedulerDriver;
          +
          +import java.util.List;
          +
          +/**
          + * This class reacts to callbacks from the Mesos scheduler driver.
          + *
          + * In order to preserve actor concurrency safety, this class simply sends
          + * corresponding messages to the Mesos resource master actor.
          + *
          + * See https://mesos.apache.org/api/latest/java/org/apache/mesos/Scheduler.html
          + */
          +public class SchedulerProxy implements Scheduler {
          +
          + /** The actor to which we report the callbacks */
          + private ActorRef mesosActor;
          — End diff –

          The `MesosActor` is actually the `MesosFlinkResourceManager`, right?

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75284993 — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/scheduler/SchedulerProxy.java — @@ -0,0 +1,105 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.scheduler; + +import akka.actor.ActorRef; + +import org.apache.flink.mesos.scheduler.messages.Disconnected; +import org.apache.flink.mesos.scheduler.messages.Error; +import org.apache.flink.mesos.scheduler.messages.Error; +import org.apache.flink.mesos.scheduler.messages.OfferRescinded; +import org.apache.flink.mesos.scheduler.messages.ReRegistered; +import org.apache.flink.mesos.scheduler.messages.Registered; +import org.apache.flink.mesos.scheduler.messages.ResourceOffers; +import org.apache.flink.mesos.scheduler.messages.SlaveLost; +import org.apache.flink.mesos.scheduler.messages.StatusUpdate; +import org.apache.mesos.Protos; +import org.apache.mesos.Scheduler; +import org.apache.mesos.SchedulerDriver; + +import java.util.List; + +/** + * This class reacts to callbacks from the Mesos scheduler driver. + * + * In order to preserve actor concurrency safety, this class simply sends + * corresponding messages to the Mesos resource master actor. + * + * See https://mesos.apache.org/api/latest/java/org/apache/mesos/Scheduler.html + */ +public class SchedulerProxy implements Scheduler { + + /** The actor to which we report the callbacks */ + private ActorRef mesosActor; — End diff – The `MesosActor` is actually the `MesosFlinkResourceManager`, right?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75283918

          — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/store/MesosWorkerStore.java —
          @@ -0,0 +1,152 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.runtime.clusterframework.store;
          +
          +import org.apache.mesos.Protos;
          +import scala.Option;
          +
          +import java.io.Serializable;
          +import java.text.DecimalFormat;
          +import java.util.List;
          +import java.util.Objects;
          +
          +import static java.util.Objects.requireNonNull;
          +
          +/**
          + * A store of Mesos workers and associated framework information.
          + *
          + * Generates a framework ID as necessary.
          + */
          +public interface MesosWorkerStore {
          +
          + static final DecimalFormat TASKID_FORMAT = new DecimalFormat("taskmanager-00000");
          +
          + void start() throws Exception;
          +
          + void stop() throws Exception;
          +
          + Option<Protos.FrameworkID> getFrameworkID() throws Exception;
          +
          + void setFrameworkID(Option<Protos.FrameworkID> frameworkID) throws Exception;
          +
          + List<Worker> recoverWorkers() throws Exception;
          +
          + Protos.TaskID newTaskID() throws Exception;
          +
          + void putWorker(Worker worker) throws Exception;
          +
          + void removeWorker(Protos.TaskID taskID) throws Exception;
          +
          + void cleanup() throws Exception;
          +
          + /**
          + * A stored task.
          + *
          + * The assigned slaveid/hostname is valid in Launched and Released states. The hostname is needed
          + * by Fenzo for optimization purposes.
          + */
          + class Worker implements Serializable {
          + private Protos.TaskID taskID;
          +
          + private Option<Protos.SlaveID> slaveID;
          +
          + private Option<String> hostname;
          +
          + private TaskState state;
          +
          + public Worker(Protos.TaskID taskID, Option<Protos.SlaveID> slaveID, Option<String> hostname, TaskState state)

          { + requireNonNull(taskID, "taskID"); + requireNonNull(slaveID, "slaveID"); + requireNonNull(hostname, "hostname"); + requireNonNull(state, "state"); + + this.taskID = taskID; + this.slaveID = slaveID; + this.hostname = hostname; + this.state = state; + }

          +
          + public Protos.TaskID taskID()

          { + return taskID; + }

          +
          + public Option<Protos.SlaveID> slaveID()

          { + return slaveID; + }

          +
          + public Option<String> hostname()

          { + return hostname; + }

          +
          + public TaskState state()

          { + return state; + }

          +
          + // valid transition methods
          +
          + public static Worker newTask(Protos.TaskID taskID)

          { + return new Worker( + taskID, + Option.<Protos.SlaveID>empty(), Option.<String>empty(), + TaskState.New); + }

          +
          + public Worker launchTask(Protos.SlaveID slaveID, String hostname)

          { + return new Worker(taskID, Option.apply(slaveID), Option.apply(hostname), TaskState.Launched); + }

          +
          + public Worker releaseTask()

          { + return new Worker(taskID, slaveID, hostname, TaskState.Released); + }

          +
          + @Override
          + public boolean equals(Object o) {
          + if (this == o)

          { + return true; + }

          + if (o == null || getClass() != o.getClass())

          { + return false; + }

          + Worker worker = (Worker) o;
          + return Objects.equals(taskID, worker.taskID) &&
          + Objects.equals(slaveID.isDefined() ? slaveID.get() : null, worker.slaveID.isDefined() ? worker.slaveID.get() : null) &&
          + Objects.equals(hostname.isDefined() ? hostname.get() : null, worker.hostname.isDefined() ? worker.hostname.get() : null) &&
          + state == worker.state;
          + }
          +
          + @Override
          + public int hashCode()

          { + return Objects.hash(taskID, slaveID.isDefined() ? slaveID.get() : null, hostname.isDefined() ? hostname.get() : null, state); + }

          +
          + @Override
          + public String toString() {
          + return "Worker

          {" + + "taskID=" + taskID + + ", slaveID=" + slaveID + + ", hostname=" + hostname + + ", state=" + state + + '}

          ';
          + }
          + }
          +
          + enum TaskState {
          + New,Launched,Released
          — End diff –

          How reformatting and adding acomments about the states here?

          ```java
          New, //
          Launched, //
          Released //
          ```

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75283918 — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/store/MesosWorkerStore.java — @@ -0,0 +1,152 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.runtime.clusterframework.store; + +import org.apache.mesos.Protos; +import scala.Option; + +import java.io.Serializable; +import java.text.DecimalFormat; +import java.util.List; +import java.util.Objects; + +import static java.util.Objects.requireNonNull; + +/** + * A store of Mesos workers and associated framework information. + * + * Generates a framework ID as necessary. + */ +public interface MesosWorkerStore { + + static final DecimalFormat TASKID_FORMAT = new DecimalFormat("taskmanager-00000"); + + void start() throws Exception; + + void stop() throws Exception; + + Option<Protos.FrameworkID> getFrameworkID() throws Exception; + + void setFrameworkID(Option<Protos.FrameworkID> frameworkID) throws Exception; + + List<Worker> recoverWorkers() throws Exception; + + Protos.TaskID newTaskID() throws Exception; + + void putWorker(Worker worker) throws Exception; + + void removeWorker(Protos.TaskID taskID) throws Exception; + + void cleanup() throws Exception; + + /** + * A stored task. + * + * The assigned slaveid/hostname is valid in Launched and Released states. The hostname is needed + * by Fenzo for optimization purposes. + */ + class Worker implements Serializable { + private Protos.TaskID taskID; + + private Option<Protos.SlaveID> slaveID; + + private Option<String> hostname; + + private TaskState state; + + public Worker(Protos.TaskID taskID, Option<Protos.SlaveID> slaveID, Option<String> hostname, TaskState state) { + requireNonNull(taskID, "taskID"); + requireNonNull(slaveID, "slaveID"); + requireNonNull(hostname, "hostname"); + requireNonNull(state, "state"); + + this.taskID = taskID; + this.slaveID = slaveID; + this.hostname = hostname; + this.state = state; + } + + public Protos.TaskID taskID() { + return taskID; + } + + public Option<Protos.SlaveID> slaveID() { + return slaveID; + } + + public Option<String> hostname() { + return hostname; + } + + public TaskState state() { + return state; + } + + // valid transition methods + + public static Worker newTask(Protos.TaskID taskID) { + return new Worker( + taskID, + Option.<Protos.SlaveID>empty(), Option.<String>empty(), + TaskState.New); + } + + public Worker launchTask(Protos.SlaveID slaveID, String hostname) { + return new Worker(taskID, Option.apply(slaveID), Option.apply(hostname), TaskState.Launched); + } + + public Worker releaseTask() { + return new Worker(taskID, slaveID, hostname, TaskState.Released); + } + + @Override + public boolean equals(Object o) { + if (this == o) { + return true; + } + if (o == null || getClass() != o.getClass()) { + return false; + } + Worker worker = (Worker) o; + return Objects.equals(taskID, worker.taskID) && + Objects.equals(slaveID.isDefined() ? slaveID.get() : null, worker.slaveID.isDefined() ? worker.slaveID.get() : null) && + Objects.equals(hostname.isDefined() ? hostname.get() : null, worker.hostname.isDefined() ? worker.hostname.get() : null) && + state == worker.state; + } + + @Override + public int hashCode() { + return Objects.hash(taskID, slaveID.isDefined() ? slaveID.get() : null, hostname.isDefined() ? hostname.get() : null, state); + } + + @Override + public String toString() { + return "Worker {" + + "taskID=" + taskID + + ", slaveID=" + slaveID + + ", hostname=" + hostname + + ", state=" + state + + '} '; + } + } + + enum TaskState { + New,Launched,Released — End diff – How reformatting and adding acomments about the states here? ```java New, // Launched, // Released // ```
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75283689

          — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/store/MesosWorkerStore.java —
          @@ -0,0 +1,152 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.runtime.clusterframework.store;
          +
          +import org.apache.mesos.Protos;
          +import scala.Option;
          +
          +import java.io.Serializable;
          +import java.text.DecimalFormat;
          +import java.util.List;
          +import java.util.Objects;
          +
          +import static java.util.Objects.requireNonNull;
          +
          +/**
          + * A store of Mesos workers and associated framework information.
          + *
          + * Generates a framework ID as necessary.
          + */
          +public interface MesosWorkerStore {
          +
          + static final DecimalFormat TASKID_FORMAT = new DecimalFormat("taskmanager-00000");
          +
          + void start() throws Exception;
          +
          + void stop() throws Exception;
          +
          + Option<Protos.FrameworkID> getFrameworkID() throws Exception;
          +
          + void setFrameworkID(Option<Protos.FrameworkID> frameworkID) throws Exception;
          +
          + List<Worker> recoverWorkers() throws Exception;
          +
          + Protos.TaskID newTaskID() throws Exception;
          +
          + void putWorker(Worker worker) throws Exception;
          +
          + void removeWorker(Protos.TaskID taskID) throws Exception;
          +
          + void cleanup() throws Exception;
          +
          + /**
          + * A stored task.
          + *
          + * The assigned slaveid/hostname is valid in Launched and Released states. The hostname is needed
          + * by Fenzo for optimization purposes.
          + */
          + class Worker implements Serializable {
          + private Protos.TaskID taskID;
          +
          + private Option<Protos.SlaveID> slaveID;
          +
          + private Option<String> hostname;
          +
          + private TaskState state;
          +
          + public Worker(Protos.TaskID taskID, Option<Protos.SlaveID> slaveID, Option<String> hostname, TaskState state)

          { + requireNonNull(taskID, "taskID"); + requireNonNull(slaveID, "slaveID"); + requireNonNull(hostname, "hostname"); + requireNonNull(state, "state"); + + this.taskID = taskID; + this.slaveID = slaveID; + this.hostname = hostname; + this.state = state; + }

          +
          + public Protos.TaskID taskID()

          { + return taskID; + }

          +
          + public Option<Protos.SlaveID> slaveID()

          { + return slaveID; + }

          +
          + public Option<String> hostname()

          { + return hostname; + }

          +
          + public TaskState state()

          { + return state; + }

          +
          + // valid transition methods
          — End diff –

          Could you frame the transition methods with comments?

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75283689 — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/store/MesosWorkerStore.java — @@ -0,0 +1,152 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.runtime.clusterframework.store; + +import org.apache.mesos.Protos; +import scala.Option; + +import java.io.Serializable; +import java.text.DecimalFormat; +import java.util.List; +import java.util.Objects; + +import static java.util.Objects.requireNonNull; + +/** + * A store of Mesos workers and associated framework information. + * + * Generates a framework ID as necessary. + */ +public interface MesosWorkerStore { + + static final DecimalFormat TASKID_FORMAT = new DecimalFormat("taskmanager-00000"); + + void start() throws Exception; + + void stop() throws Exception; + + Option<Protos.FrameworkID> getFrameworkID() throws Exception; + + void setFrameworkID(Option<Protos.FrameworkID> frameworkID) throws Exception; + + List<Worker> recoverWorkers() throws Exception; + + Protos.TaskID newTaskID() throws Exception; + + void putWorker(Worker worker) throws Exception; + + void removeWorker(Protos.TaskID taskID) throws Exception; + + void cleanup() throws Exception; + + /** + * A stored task. + * + * The assigned slaveid/hostname is valid in Launched and Released states. The hostname is needed + * by Fenzo for optimization purposes. + */ + class Worker implements Serializable { + private Protos.TaskID taskID; + + private Option<Protos.SlaveID> slaveID; + + private Option<String> hostname; + + private TaskState state; + + public Worker(Protos.TaskID taskID, Option<Protos.SlaveID> slaveID, Option<String> hostname, TaskState state) { + requireNonNull(taskID, "taskID"); + requireNonNull(slaveID, "slaveID"); + requireNonNull(hostname, "hostname"); + requireNonNull(state, "state"); + + this.taskID = taskID; + this.slaveID = slaveID; + this.hostname = hostname; + this.state = state; + } + + public Protos.TaskID taskID() { + return taskID; + } + + public Option<Protos.SlaveID> slaveID() { + return slaveID; + } + + public Option<String> hostname() { + return hostname; + } + + public TaskState state() { + return state; + } + + // valid transition methods — End diff – Could you frame the transition methods with comments?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75283539

          — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/store/ZooKeeperMesosWorkerStore.java —
          @@ -0,0 +1,290 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.runtime.clusterframework.store;
          +
          +import org.apache.curator.framework.CuratorFramework;
          +import org.apache.curator.framework.recipes.shared.SharedCount;
          +import org.apache.curator.framework.recipes.shared.SharedValue;
          +import org.apache.curator.framework.recipes.shared.VersionedValue;
          +import org.apache.flink.api.java.tuple.Tuple2;
          +import org.apache.flink.configuration.ConfigConstants;
          +import org.apache.flink.configuration.Configuration;
          +import org.apache.flink.runtime.state.StateHandle;
          +import org.apache.flink.runtime.util.ZooKeeperUtils;
          +import org.apache.flink.runtime.zookeeper.StateStorageHelper;
          +import org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore;
          +import org.apache.mesos.Protos;
          +import org.apache.zookeeper.KeeperException;
          +import org.slf4j.Logger;
          +import org.slf4j.LoggerFactory;
          +import scala.Option;
          +
          +import java.util.ArrayList;
          +import java.util.Collections;
          +import java.util.ConcurrentModificationException;
          +import java.util.List;
          +
          +import static org.apache.flink.util.Preconditions.checkNotNull;
          +import static org.apache.flink.util.Preconditions.checkState;
          +
          +/**
          + * A ZooKeeper-backed Mesos worker store.
          + */
          +public class ZooKeeperMesosWorkerStore implements MesosWorkerStore {
          +
          + private static final Logger LOG = LoggerFactory.getLogger(ZooKeeperMesosWorkerStore.class);
          +
          + private final Object cacheLock = new Object();
          — End diff –

          Seems like the store should only be accessed by the ResourceManager. In this case we could remove the lock.

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75283539 — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/store/ZooKeeperMesosWorkerStore.java — @@ -0,0 +1,290 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.runtime.clusterframework.store; + +import org.apache.curator.framework.CuratorFramework; +import org.apache.curator.framework.recipes.shared.SharedCount; +import org.apache.curator.framework.recipes.shared.SharedValue; +import org.apache.curator.framework.recipes.shared.VersionedValue; +import org.apache.flink.api.java.tuple.Tuple2; +import org.apache.flink.configuration.ConfigConstants; +import org.apache.flink.configuration.Configuration; +import org.apache.flink.runtime.state.StateHandle; +import org.apache.flink.runtime.util.ZooKeeperUtils; +import org.apache.flink.runtime.zookeeper.StateStorageHelper; +import org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore; +import org.apache.mesos.Protos; +import org.apache.zookeeper.KeeperException; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import scala.Option; + +import java.util.ArrayList; +import java.util.Collections; +import java.util.ConcurrentModificationException; +import java.util.List; + +import static org.apache.flink.util.Preconditions.checkNotNull; +import static org.apache.flink.util.Preconditions.checkState; + +/** + * A ZooKeeper-backed Mesos worker store. + */ +public class ZooKeeperMesosWorkerStore implements MesosWorkerStore { + + private static final Logger LOG = LoggerFactory.getLogger(ZooKeeperMesosWorkerStore.class); + + private final Object cacheLock = new Object(); — End diff – Seems like the store should only be accessed by the ResourceManager. In this case we could remove the lock.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75282731

          — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/store/StandaloneMesosWorkerStore.java —
          @@ -0,0 +1,87 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.runtime.clusterframework.store;
          +
          +import com.google.common.collect.ImmutableList;
          +import org.apache.mesos.Protos;
          +import scala.Option;
          +
          +import java.util.LinkedHashMap;
          +import java.util.List;
          +import java.util.Map;
          +
          +/**
          + * A standalone Mesos worker store.
          + */
          +public class StandaloneMesosWorkerStore implements MesosWorkerStore {
          +
          + private Option<Protos.FrameworkID> frameworkID = Option.empty();
          +
          + private int taskCount = 0;
          +
          + private Map<Protos.TaskID, Worker> storedWorkers = new LinkedHashMap<>();
          +
          + public StandaloneMesosWorkerStore()

          { + }

          +
          + @Override
          + public void start() throws Exception

          { + + }
          +
          + @Override
          + public void stop() throws Exception { + + }

          +
          + @Override
          + public Option<Protos.FrameworkID> getFrameworkID() throws Exception

          { + return frameworkID; + }

          +
          + @Override
          + public void setFrameworkID(Option<Protos.FrameworkID> frameworkID) throws Exception

          { + this.frameworkID = frameworkID; + }

          +
          + @Override
          + public List<Worker> recoverWorkers() throws Exception

          { + return ImmutableList.copyOf(storedWorkers.values()); + }

          +
          + @Override
          + public Protos.TaskID newTaskID() throws Exception {
          + Protos.TaskID taskID = Protos.TaskID.newBuilder().setValue(TASKID_FORMAT.format(++taskCount)).build();
          + return taskID;
          — End diff –

          Could be simplified: `return Protos.TaskID.newBuilder().setValue(TASKID_FORMAT.format(++taskCount)).build();`.

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75282731 — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/store/StandaloneMesosWorkerStore.java — @@ -0,0 +1,87 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.runtime.clusterframework.store; + +import com.google.common.collect.ImmutableList; +import org.apache.mesos.Protos; +import scala.Option; + +import java.util.LinkedHashMap; +import java.util.List; +import java.util.Map; + +/** + * A standalone Mesos worker store. + */ +public class StandaloneMesosWorkerStore implements MesosWorkerStore { + + private Option<Protos.FrameworkID> frameworkID = Option.empty(); + + private int taskCount = 0; + + private Map<Protos.TaskID, Worker> storedWorkers = new LinkedHashMap<>(); + + public StandaloneMesosWorkerStore() { + } + + @Override + public void start() throws Exception { + + } + + @Override + public void stop() throws Exception { + + } + + @Override + public Option<Protos.FrameworkID> getFrameworkID() throws Exception { + return frameworkID; + } + + @Override + public void setFrameworkID(Option<Protos.FrameworkID> frameworkID) throws Exception { + this.frameworkID = frameworkID; + } + + @Override + public List<Worker> recoverWorkers() throws Exception { + return ImmutableList.copyOf(storedWorkers.values()); + } + + @Override + public Protos.TaskID newTaskID() throws Exception { + Protos.TaskID taskID = Protos.TaskID.newBuilder().setValue(TASKID_FORMAT.format(++taskCount)).build(); + return taskID; — End diff – Could be simplified: `return Protos.TaskID.newBuilder().setValue(TASKID_FORMAT.format(++taskCount)).build();`.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75282003

          — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/store/MesosWorkerStore.java —
          @@ -0,0 +1,152 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.runtime.clusterframework.store;
          +
          +import org.apache.mesos.Protos;
          +import scala.Option;
          +
          +import java.io.Serializable;
          +import java.text.DecimalFormat;
          +import java.util.List;
          +import java.util.Objects;
          +
          +import static java.util.Objects.requireNonNull;
          +
          +/**
          + * A store of Mesos workers and associated framework information.
          + *
          + * Generates a framework ID as necessary.
          + */
          +public interface MesosWorkerStore {
          +
          + static final DecimalFormat TASKID_FORMAT = new DecimalFormat("taskmanager-00000");
          — End diff –

          Per definition, variables are static and final in interfaces

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75282003 — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/store/MesosWorkerStore.java — @@ -0,0 +1,152 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.runtime.clusterframework.store; + +import org.apache.mesos.Protos; +import scala.Option; + +import java.io.Serializable; +import java.text.DecimalFormat; +import java.util.List; +import java.util.Objects; + +import static java.util.Objects.requireNonNull; + +/** + * A store of Mesos workers and associated framework information. + * + * Generates a framework ID as necessary. + */ +public interface MesosWorkerStore { + + static final DecimalFormat TASKID_FORMAT = new DecimalFormat("taskmanager-00000"); — End diff – Per definition, variables are static and final in interfaces
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75281657

          — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/RegisteredMesosWorkerNode.scala —
          @@ -0,0 +1,33 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.runtime.clusterframework
          +
          +import org.apache.flink.mesos.runtime.clusterframework.store.MesosWorkerStore
          +import org.apache.flink.runtime.clusterframework.types.

          {ResourceID, ResourceIDRetrievable}

          +
          +/**
          + * A representation of a registered Mesos task managed by the

          {@link MesosFlinkResourceManager}

          .
          + */
          +case class RegisteredMesosWorkerNode(task: MesosWorkerStore.Worker) extends ResourceIDRetrievable {
          — End diff –

          Why is this class written in Scala? It seems like this class is only used from Java code.

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75281657 — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/RegisteredMesosWorkerNode.scala — @@ -0,0 +1,33 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.runtime.clusterframework + +import org.apache.flink.mesos.runtime.clusterframework.store.MesosWorkerStore +import org.apache.flink.runtime.clusterframework.types. {ResourceID, ResourceIDRetrievable} + +/** + * A representation of a registered Mesos task managed by the {@link MesosFlinkResourceManager} . + */ +case class RegisteredMesosWorkerNode(task: MesosWorkerStore.Worker) extends ResourceIDRetrievable { — End diff – Why is this class written in Scala? It seems like this class is only used from Java code.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75280315

          — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/MesosFlinkResourceManager.java —
          @@ -0,0 +1,755 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.runtime.clusterframework;
          +
          +import akka.actor.ActorRef;
          +import akka.actor.Props;
          +import com.netflix.fenzo.TaskRequest;
          +import com.netflix.fenzo.TaskScheduler;
          +import com.netflix.fenzo.VirtualMachineLease;
          +import com.netflix.fenzo.functions.Action1;
          +import org.apache.flink.api.java.tuple.Tuple2;
          +import org.apache.flink.configuration.ConfigConstants;
          +import org.apache.flink.configuration.Configuration;
          +import org.apache.flink.mesos.runtime.clusterframework.store.MesosWorkerStore;
          +import org.apache.flink.mesos.scheduler.ConnectionMonitor;
          +import org.apache.flink.mesos.scheduler.LaunchableTask;
          +import org.apache.flink.mesos.scheduler.LaunchCoordinator;
          +import org.apache.flink.mesos.scheduler.ReconciliationCoordinator;
          +import org.apache.flink.mesos.scheduler.SchedulerProxy;
          +import org.apache.flink.mesos.scheduler.TaskMonitor;
          +import org.apache.flink.mesos.scheduler.TaskSchedulerBuilder;
          +import org.apache.flink.mesos.scheduler.Tasks;
          +import org.apache.flink.mesos.scheduler.messages.AcceptOffers;
          +import org.apache.flink.mesos.scheduler.messages.Disconnected;
          +import org.apache.flink.mesos.scheduler.messages.Error;
          +import org.apache.flink.mesos.scheduler.messages.OfferRescinded;
          +import org.apache.flink.mesos.scheduler.messages.ReRegistered;
          +import org.apache.flink.mesos.scheduler.messages.Registered;
          +import org.apache.flink.mesos.scheduler.messages.ResourceOffers;
          +import org.apache.flink.mesos.scheduler.messages.StatusUpdate;
          +import org.apache.flink.mesos.util.MesosConfiguration;
          +import org.apache.flink.runtime.clusterframework.ApplicationStatus;
          +import org.apache.flink.runtime.clusterframework.FlinkResourceManager;
          +import org.apache.flink.runtime.clusterframework.messages.FatalErrorOccurred;
          +import org.apache.flink.runtime.clusterframework.messages.StopCluster;
          +import org.apache.flink.runtime.clusterframework.types.ResourceID;
          +import org.apache.flink.runtime.leaderretrieval.LeaderRetrievalService;
          +import org.apache.mesos.Protos;
          +import org.apache.mesos.Protos.FrameworkInfo;
          +import org.apache.mesos.SchedulerDriver;
          +import org.slf4j.Logger;
          +import scala.Option;
          +
          +import java.util.ArrayList;
          +import java.util.Collection;
          +import java.util.HashMap;
          +import java.util.List;
          +import java.util.Map;
          +
          +import static java.util.Objects.requireNonNull;
          +
          +/**
          + * Flink Resource Manager for Apache Mesos.
          + */
          +public class MesosFlinkResourceManager extends FlinkResourceManager<RegisteredMesosWorkerNode> {
          +
          + /** The Mesos configuration (master and framework info) */
          + private final MesosConfiguration mesosConfig;
          +
          + /** The TaskManager container parameters (like container memory size) */
          + private final MesosTaskManagerParameters taskManagerParameters;
          +
          + /** Context information used to start a TaskManager Java process */
          + private final Protos.TaskInfo.Builder taskManagerLaunchContext;
          +
          + /** Number of failed Mesos tasks before stopping the application. -1 means infinite. */
          + private final int maxFailedTasks;
          +
          + /** Callback handler for the asynchronous Mesos scheduler */
          + private SchedulerProxy schedulerCallbackHandler;
          +
          + /** Mesos scheduler driver */
          + private SchedulerDriver schedulerDriver;
          +
          + private ActorRef connectionMonitor;
          +
          + private ActorRef taskRouter;
          +
          + private ActorRef launchCoordinator;
          +
          + private ActorRef reconciliationCoordinator;
          +
          + private MesosWorkerStore workerStore;
          +
          + final Map<ResourceID, MesosWorkerStore.Worker> workersInNew;
          + final Map<ResourceID, MesosWorkerStore.Worker> workersInLaunch;
          + final Map<ResourceID, MesosWorkerStore.Worker> workersBeingReturned;
          +
          + /** The number of failed tasks since the master became active */
          + private int failedTasksSoFar;
          +
          + public MesosFlinkResourceManager(
          + Configuration flinkConfig,
          + MesosConfiguration mesosConfig,
          + MesosWorkerStore workerStore,
          + LeaderRetrievalService leaderRetrievalService,
          + MesosTaskManagerParameters taskManagerParameters,
          + Protos.TaskInfo.Builder taskManagerLaunchContext,
          + int maxFailedTasks,
          + int numInitialTaskManagers)

          { + + super(numInitialTaskManagers, flinkConfig, leaderRetrievalService); + + this.mesosConfig = requireNonNull(mesosConfig); + + this.workerStore = requireNonNull(workerStore); + + this.taskManagerParameters = requireNonNull(taskManagerParameters); + this.taskManagerLaunchContext = requireNonNull(taskManagerLaunchContext); + this.maxFailedTasks = maxFailedTasks; + + this.workersInNew = new HashMap<>(); + this.workersInLaunch = new HashMap<>(); + this.workersBeingReturned = new HashMap<>(); + }

          +
          + // ------------------------------------------------------------------------
          + // Mesos-specific behavior
          + // ------------------------------------------------------------------------
          +
          + @Override
          + protected void initialize() throws Exception {
          + LOG.info("Initializing Mesos resource master");
          +
          + workerStore.start();
          +
          + // create the scheduler driver to communicate with Mesos
          + schedulerCallbackHandler = new SchedulerProxy(self());
          +
          + // register with Mesos
          + FrameworkInfo.Builder frameworkInfo = mesosConfig.frameworkInfo()
          + .clone()
          + .setCheckpoint(true);
          +
          + Option<Protos.FrameworkID> frameworkID = workerStore.getFrameworkID();
          + if(frameworkID.isEmpty())

          { + LOG.info("Registering as new framework."); + }

          + else {
          + LOG.info("Recovery scenario: re-registering using framework ID {}.", frameworkID.get().getValue());
          + frameworkInfo.setId(frameworkID.get());
          + }
          +
          + MesosConfiguration initializedMesosConfig = mesosConfig.withFrameworkInfo(frameworkInfo);
          + MesosConfiguration.logMesosConfig(LOG, initializedMesosConfig);
          + schedulerDriver = initializedMesosConfig.createDriver(schedulerCallbackHandler, false);
          +
          + // create supporting actors
          + connectionMonitor = createConnectionMonitor();
          + launchCoordinator = createLaunchCoordinator();
          + reconciliationCoordinator = createReconciliationCoordinator();
          + taskRouter = createTaskRouter();
          +
          + recoverWorkers();
          +
          + connectionMonitor.tell(new ConnectionMonitor.Start(), self());
          + schedulerDriver.start();
          + }
          +
          + protected ActorRef createConnectionMonitor()

          { + return context().actorOf( + ConnectionMonitor.createActorProps(ConnectionMonitor.class, config), + "connectionMonitor"); + }

          +
          + protected ActorRef createTaskRouter()

          { + return context().actorOf( + Tasks.createActorProps(Tasks.class, config, schedulerDriver, TaskMonitor.class), + "tasks"); + }

          +
          + protected ActorRef createLaunchCoordinator()

          { + return context().actorOf( + LaunchCoordinator.createActorProps(LaunchCoordinator.class, self(), config, schedulerDriver, createOptimizer()), + "launchCoordinator"); + }

          +
          + protected ActorRef createReconciliationCoordinator()

          { + return context().actorOf( + ReconciliationCoordinator.createActorProps(ReconciliationCoordinator.class, config, schedulerDriver), + "reconciliationCoordinator"); + }

          +
          + @Override
          + public void postStop()

          { + LOG.info("Stopping Mesos resource master"); + super.postStop(); + }

          +
          + // ------------------------------------------------------------------------
          + // Actor messages
          + // ------------------------------------------------------------------------
          +
          + @Override
          + protected void handleMessage(Object message) {
          +
          + // check for Mesos-specific actor messages first
          +
          + // — messages about Mesos connection
          + if (message instanceof Registered)

          { + registered((Registered) message); + }

          else if (message instanceof ReRegistered)

          { + reregistered((ReRegistered) message); + }

          else if (message instanceof Disconnected)

          { + disconnected((Disconnected) message); + }

          else if (message instanceof Error)

          { + error(((Error) message).message()); + + // --- messages about offers + }

          else if (message instanceof ResourceOffers || message instanceof OfferRescinded)

          { + launchCoordinator.tell(message, self()); + }

          else if (message instanceof AcceptOffers)

          { + acceptOffers((AcceptOffers) message); + + // --- messages about tasks + }

          else if (message instanceof StatusUpdate)

          { + taskStatusUpdated((StatusUpdate) message); + }

          else if (message instanceof ReconciliationCoordinator.Reconcile)

          { + // a reconciliation request from a task + reconciliationCoordinator.tell(message, self()); + }

          else if (message instanceof TaskMonitor.TaskTerminated)

          { + // a termination message from a task + TaskMonitor.TaskTerminated msg = (TaskMonitor.TaskTerminated) message; + taskTerminated(msg.taskID(), msg.status()); + + }

          else

          { + // message handled by the generic resource master code + super.handleMessage(message); + }

          + }
          +
          + /**
          + * Called to shut down the cluster (not a failover situation).
          + *
          + * @param finalStatus The application status to report.
          + * @param optionalDiagnostics An optional diagnostics message.
          + */
          + @Override
          + protected void shutdownApplication(ApplicationStatus finalStatus, String optionalDiagnostics) {
          +
          + LOG.info("Shutting down and unregistering as a Mesos framework.");
          + try

          { + // unregister the framework, which implicitly removes all tasks. + schedulerDriver.stop(false); + }

          + catch(Exception ex)

          { + LOG.warn("unable to unregister the framework", ex); + }

          +
          + try

          { + workerStore.cleanup(); + }

          + catch(Exception ex)

          { + LOG.warn("unable to cleanup the ZooKeeper state", ex); + }

          +
          + context().stop(self());
          + }
          +
          + @Override
          + protected void fatalError(String message, Throwable error)

          { + // we do not unregister, but cause a hard fail of this process, to have it + // restarted by the dispatcher + LOG.error("FATAL ERROR IN MESOS APPLICATION MASTER: " + message, error); + LOG.error("Shutting down process"); + + // kill this process, this will make an external supervisor (the dispatcher) restart the process + System.exit(EXIT_CODE_FATAL_ERROR); + }

          +
          + // ------------------------------------------------------------------------
          + // Worker Management
          + // ------------------------------------------------------------------------
          +
          + /**
          + * Recover framework/worker information persisted by a prior incarnation of the RM.
          + */
          + private void recoverWorkers() throws Exception {
          + // if this application master starts as part of an ApplicationMaster/JobManager recovery,
          + // then some worker tasks are most likely still alive and we can re-obtain them
          + final List<MesosWorkerStore.Worker> tasksFromPreviousAttempts = workerStore.recoverWorkers();
          +
          + if (!tasksFromPreviousAttempts.isEmpty()) {
          + LOG.info("Retrieved {} TaskManagers from previous attempt", tasksFromPreviousAttempts.size());
          +
          + List<Tuple2<TaskRequest,String>> toAssign = new ArrayList<>(tasksFromPreviousAttempts.size());
          + List<LaunchableTask> toLaunch = new ArrayList<>(tasksFromPreviousAttempts.size());
          +
          + for (final MesosWorkerStore.Worker worker : tasksFromPreviousAttempts) {
          + LaunchableMesosWorker launchable = createLaunchableMesosWorker(worker.taskID());
          +
          + switch(worker.state())

          { + case New: + workersInNew.put(extractResourceID(worker.taskID()), worker); + toLaunch.add(launchable); + break; + case Launched: + workersInLaunch.put(extractResourceID(worker.taskID()), worker); + toAssign.add(new Tuple2<>(launchable.taskRequest(), worker.hostname().get())); + break; + case Released: + workersBeingReturned.put(extractResourceID(worker.taskID()), worker); + break; + }

          + taskRouter.tell(new TaskMonitor.TaskGoalStateUpdated(extractGoalState(worker)), self());
          + }
          +
          + // tell the launch coordinator about prior assignments
          + if(toAssign.size() >= 1)

          { + launchCoordinator.tell(new LaunchCoordinator.Assign(toAssign), self()); + }

          + // tell the launch coordinator to launch any new tasks
          + if(toLaunch.size() >= 1)

          { + launchCoordinator.tell(new LaunchCoordinator.Launch(toLaunch), self()); + }
          + }
          + }
          +
          + /**
          + * Plan for some additional workers to be launched.
          + *
          + * @param numWorkers The number of workers to allocate.
          + */
          + @Override
          + protected void requestNewWorkers(int numWorkers) {
          +
          + try {
          + List<TaskMonitor.TaskGoalStateUpdated> toMonitor = new ArrayList<>(numWorkers);
          + List<LaunchableTask> toLaunch = new ArrayList<>(numWorkers);
          +
          + // generate new workers into persistent state and launch associated actors
          + for (int i = 0; i < numWorkers; i++) {
          + MesosWorkerStore.Worker worker = MesosWorkerStore.Worker.newTask(workerStore.newTaskID());
          + workerStore.putWorker(worker);
          + workersInNew.put(extractResourceID(worker.taskID()), worker);
          +
          + LaunchableMesosWorker launchable = createLaunchableMesosWorker(worker.taskID());
          +
          + LOG.info("Scheduling Mesos task {} with ({} MB, {} cpus).",
          + launchable.taskID().getValue(), launchable.taskRequest().getMemory(), launchable.taskRequest().getCPUs());
          +
          + toMonitor.add(new TaskMonitor.TaskGoalStateUpdated(extractGoalState(worker)));
          + toLaunch.add(launchable);
          + }
          +
          + // tell the task router about the new plans
          + for (TaskMonitor.TaskGoalStateUpdated update : toMonitor) { + taskRouter.tell(update, self()); + }
          +
          + // tell the launch coordinator to launch the new tasks
          + if(toLaunch.size() >= 1) { + launchCoordinator.tell(new LaunchCoordinator.Launch(toLaunch), self()); + }

          + }
          + catch(Exception ex)

          { + fatalError("unable to request new workers", ex); + }

          + }
          +
          + /**
          + * Accept offers as advised by the launch coordinator.
          + *
          + * Acceptance is routed through the RM to update the persistent state before
          + * forwarding the message to Mesos.
          + */
          + private void acceptOffers(AcceptOffers msg) {
          +
          + try {
          + List<TaskMonitor.TaskGoalStateUpdated> toMonitor = new ArrayList<>(msg.operations().size());
          +
          + // transition the persistent state of some tasks to Launched
          + for (Protos.Offer.Operation op : msg.operations()) {
          + if (op.getType() != Protos.Offer.Operation.Type.LAUNCH)

          { + continue; + }

          + for (Protos.TaskInfo info : op.getLaunch().getTaskInfosList()) {
          + MesosWorkerStore.Worker worker = workersInNew.remove(extractResourceID(info.getTaskId()));
          + assert (worker != null);
          +
          + worker = worker.launchTask(info.getSlaveId(), msg.hostname());
          + workerStore.putWorker(worker);
          + workersInLaunch.put(extractResourceID(worker.taskID()), worker);
          +
          + LOG.info("Launching Mesos task {} on host {}.",
          + worker.taskID().getValue(), worker.hostname().get());
          +
          + toMonitor.add(new TaskMonitor.TaskGoalStateUpdated(extractGoalState(worker)));
          + }
          + }
          +
          + // tell the task router about the new plans
          + for (TaskMonitor.TaskGoalStateUpdated update : toMonitor)

          { + taskRouter.tell(update, self()); + }

          +
          + // send the acceptance message to Mesos
          + schedulerDriver.acceptOffers(msg.offerIds(), msg.operations(), msg.filters());
          + }
          + catch(Exception ex)

          { + fatalError("unable to accept offers", ex); + }

          + }
          +
          + /**
          + * Handle a task status change.
          + */
          + private void taskStatusUpdated(StatusUpdate message)

          { + taskRouter.tell(message, self()); + reconciliationCoordinator.tell(message, self()); + schedulerDriver.acknowledgeStatusUpdate(message.status()); + }

          +
          + /**
          + * Accept the given started worker into the internal state.
          + *
          + * @param resourceID The worker resource id
          + * @return A registered worker node record.
          + */
          + @Override
          + protected RegisteredMesosWorkerNode workerStarted(ResourceID resourceID) {
          + MesosWorkerStore.Worker inLaunch = workersInLaunch.remove(resourceID);
          + if (inLaunch == null)

          { + // Worker was not in state "being launched", this can indicate that the TaskManager + // in this worker was already registered or that the container was not started + // by this resource manager. Simply ignore this resourceID. + return null; + }

          + return new RegisteredMesosWorkerNode(inLaunch);
          + }
          +
          + /**
          + * Accept the given registered workers into the internal state.
          + *
          + * @param toConsolidate The worker IDs known previously to the JobManager.
          + * @return A collection of registered worker node records.
          + */
          + @Override
          + protected Collection<RegisteredMesosWorkerNode> reacceptRegisteredWorkers(Collection<ResourceID> toConsolidate) {
          +
          + // we check for each task manager if we recognize its Mesos task ID
          + List<RegisteredMesosWorkerNode> accepted = new ArrayList<>(toConsolidate.size());
          + for (ResourceID resourceID : toConsolidate) {
          + MesosWorkerStore.Worker worker = workersInLaunch.remove(resourceID);
          + if (worker != null) {
          + LOG.info("Mesos worker consolidation recognizes TaskManager {}.", resourceID);
          + accepted.add(new RegisteredMesosWorkerNode(worker));
          + }
          + else {
          + if(isStarted(resourceID)) {
          + LOG.info("TaskManager {} has already been registered at the resource manager.", resourceID);
          + }
          + else {
          + LOG.info("Mesos worker consolidation does not recognize TaskManager {}.", resourceID);
          + }
          + }
          + }
          + return accepted;
          + }
          +
          + /**
          + * Release the given pending worker.
          + */
          + @Override
          + protected void releasePendingWorker(ResourceID id) {
          + MesosWorkerStore.Worker worker = workersInLaunch.remove(id);
          + if (worker != null)

          { + releaseWorker(worker); + }

          else {
          + LOG.error("Cannot find worker {} to release. Ignoring request.", id);
          + }
          + }
          +
          + /**
          + * Release the given started worker.
          + */
          + @Override
          + protected void releaseStartedWorker(RegisteredMesosWorkerNode worker)

          { + releaseWorker(worker.task()); + }

          +
          + /**
          + * Plan for the removal of the given worker.
          + */
          + private void releaseWorker(MesosWorkerStore.Worker worker) {
          + try {
          + LOG.info("Releasing worker {}", worker.taskID());
          +
          + // update persistent state of worker to Released
          + worker = worker.releaseTask();
          + workerStore.putWorker(worker);
          + workersBeingReturned.put(extractResourceID(worker.taskID()), worker);
          + taskRouter.tell(new TaskMonitor.TaskGoalStateUpdated(extractGoalState(worker)), self());
          +
          + if (worker.hostname().isDefined())

          { + // tell the launch coordinator that the task is being unassigned from the host, for planning purposes + launchCoordinator.tell(new LaunchCoordinator.Unassign(worker.taskID(), worker.hostname().get()), self()); + }

          + }
          + catch (Exception ex)

          { + fatalError("unable to release worker", ex); + }

          + }
          +
          + @Override
          + protected int getNumWorkerRequestsPending()

          { + return workersInNew.size(); + }

          +
          + @Override
          + protected int getNumWorkersPendingRegistration()

          { + return workersInLaunch.size(); + }

          +
          + // ------------------------------------------------------------------------
          + // Callbacks from the Mesos Master
          + // ------------------------------------------------------------------------
          +
          + /**
          + * Called when connected to Mesos as a new framework.
          + */
          + private void registered(Registered message) {
          + connectionMonitor.tell(message, self());
          +
          + try

          { + workerStore.setFrameworkID(Option.apply(message.frameworkId())); + }

          + catch(Exception ex)

          { + fatalError("unable to store the assigned framework ID", ex); + return; + }

          +
          + launchCoordinator.tell(message, self());
          + reconciliationCoordinator.tell(message, self());
          + taskRouter.tell(message, self());
          + }
          +
          + /**
          + * Called when reconnected to Mesos following a failover event.
          + */
          + private void reregistered(ReRegistered message)

          { + connectionMonitor.tell(message, self()); + launchCoordinator.tell(message, self()); + reconciliationCoordinator.tell(message, self()); + taskRouter.tell(message, self()); + }
          +
          + /**
          + * Called when disconnected from Mesos.
          + */
          + private void disconnected(Disconnected message) { + connectionMonitor.tell(message, self()); + launchCoordinator.tell(message, self()); + reconciliationCoordinator.tell(message, self()); + taskRouter.tell(message, self()); + }

          +
          + /**
          + * Called when an error is reported by the scheduler callback.
          + */
          + private void error(String message)

          { + self().tell(new FatalErrorOccurred("Connection to Mesos failed", new Exception(message)), self()); + }

          +
          + /**
          + * Invoked when a Mesos task reaches a terminal status.
          + */
          + private void taskTerminated(Protos.TaskID taskID, Protos.TaskStatus status) {
          + // this callback occurs for failed containers and for released containers alike
          +
          + final ResourceID id = extractResourceID(taskID);
          +
          + try

          { + workerStore.removeWorker(taskID); + }

          + catch(Exception ex)

          { + fatalError("unable to remove worker", ex); + return; + }

          +
          + // check if this is a failed task or a released task
          + if (workersBeingReturned.remove(id) != null) {
          + // regular finished worker that we released
          + LOG.info("Worker {} finished successfully with diagnostics: {}",
          + id, status.getMessage());
          + } else {
          + // failed worker, either at startup, or running
          + final MesosWorkerStore.Worker launched = workersInLaunch.remove(id);
          + if (launched != null) {
          + LOG.info("Mesos task {} failed, with a TaskManager in launch or registration. " +
          + "State: {} Reason: {} ({})", id, status.getState(), status.getReason(), status.getMessage());
          + // we will trigger re-acquiring new workers at the end
          + } else {
          + // failed registered worker
          + LOG.info("Mesos task {} failed, with a registered TaskManager. " +
          + "State: {} Reason: {} ({})", id, status.getState(), status.getReason(), status.getMessage());
          +
          + // notify the generic logic, which notifies the JobManager, etc.
          + notifyWorkerFailed(id, "Mesos task " + id + " failed. State: " + status.getState());
          + }
          +
          + // general failure logging
          + failedTasksSoFar++;
          +
          + String diagMessage = String.format("Diagnostics for task %s in state %s : " +
          + "reason=%s message=%s",
          + id, status.getState(), status.getReason(), status.getMessage());
          + sendInfoMessage(diagMessage);
          +
          + LOG.info(diagMessage);
          + LOG.info("Total number of failed tasks so far: " + failedTasksSoFar);
          +
          + // maxFailedTasks == -1 is infinite number of retries.
          + if (maxFailedTasks >= 0 && failedTasksSoFar > maxFailedTasks)

          { + String msg = "Stopping Mesos session because the number of failed tasks (" + + failedTasksSoFar + ") exceeded the maximum failed tasks (" + + maxFailedTasks + "). This number is controlled by the '" + + ConfigConstants.MESOS_MAX_FAILED_TASKS + "' configuration setting. " + + "By default its the number of requested tasks."; + + LOG.error(msg); + self().tell(decorateMessage(new StopCluster(ApplicationStatus.FAILED, msg)), + ActorRef.noSender()); + + // no need to do anything else + return; + }

          + }
          +
          + // in case failed containers were among the finished containers, make
          + // sure we re-examine and request new ones
          + triggerCheckWorkers();
          + }
          +
          + // ------------------------------------------------------------------------
          + // Utilities
          + // ------------------------------------------------------------------------
          +
          + private LaunchableMesosWorker createLaunchableMesosWorker(Protos.TaskID taskID)

          { + LaunchableMesosWorker launchable = + new LaunchableMesosWorker(taskManagerParameters, taskManagerLaunchContext, taskID); + return launchable; + }

          +
          + /**
          + * Extracts a unique ResourceID from the Mesos task.
          + *
          + * @param taskId the Mesos TaskID
          + * @return The ResourceID for the container
          + */
          + static ResourceID extractResourceID(Protos.TaskID taskId)

          { + return new ResourceID(taskId.getValue()); + }

          +
          + /**
          + * Extracts the Mesos task goal state from the worker information.
          + * @param worker the persistent worker information.
          + * @return goal state information for the

          {@Link TaskMonitor}

          .
          + */
          + static TaskMonitor.TaskGoalState extractGoalState(MesosWorkerStore.Worker worker) {
          + switch(worker.state())

          { + case New: return new TaskMonitor.New(worker.taskID()); + case Launched: return new TaskMonitor.Launched(worker.taskID(), worker.slaveID().get()); + case Released: return new TaskMonitor.Released(worker.taskID(), worker.slaveID().get()); + default: throw new IllegalArgumentException(); + }

          + }
          +
          + /**
          + * Creates the Fenzo optimizer (builder).
          + * The builder is an indirection to faciliate unit testing of the Launch Coordinator.
          — End diff –

          facilitate

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75280315 — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/MesosFlinkResourceManager.java — @@ -0,0 +1,755 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.runtime.clusterframework; + +import akka.actor.ActorRef; +import akka.actor.Props; +import com.netflix.fenzo.TaskRequest; +import com.netflix.fenzo.TaskScheduler; +import com.netflix.fenzo.VirtualMachineLease; +import com.netflix.fenzo.functions.Action1; +import org.apache.flink.api.java.tuple.Tuple2; +import org.apache.flink.configuration.ConfigConstants; +import org.apache.flink.configuration.Configuration; +import org.apache.flink.mesos.runtime.clusterframework.store.MesosWorkerStore; +import org.apache.flink.mesos.scheduler.ConnectionMonitor; +import org.apache.flink.mesos.scheduler.LaunchableTask; +import org.apache.flink.mesos.scheduler.LaunchCoordinator; +import org.apache.flink.mesos.scheduler.ReconciliationCoordinator; +import org.apache.flink.mesos.scheduler.SchedulerProxy; +import org.apache.flink.mesos.scheduler.TaskMonitor; +import org.apache.flink.mesos.scheduler.TaskSchedulerBuilder; +import org.apache.flink.mesos.scheduler.Tasks; +import org.apache.flink.mesos.scheduler.messages.AcceptOffers; +import org.apache.flink.mesos.scheduler.messages.Disconnected; +import org.apache.flink.mesos.scheduler.messages.Error; +import org.apache.flink.mesos.scheduler.messages.OfferRescinded; +import org.apache.flink.mesos.scheduler.messages.ReRegistered; +import org.apache.flink.mesos.scheduler.messages.Registered; +import org.apache.flink.mesos.scheduler.messages.ResourceOffers; +import org.apache.flink.mesos.scheduler.messages.StatusUpdate; +import org.apache.flink.mesos.util.MesosConfiguration; +import org.apache.flink.runtime.clusterframework.ApplicationStatus; +import org.apache.flink.runtime.clusterframework.FlinkResourceManager; +import org.apache.flink.runtime.clusterframework.messages.FatalErrorOccurred; +import org.apache.flink.runtime.clusterframework.messages.StopCluster; +import org.apache.flink.runtime.clusterframework.types.ResourceID; +import org.apache.flink.runtime.leaderretrieval.LeaderRetrievalService; +import org.apache.mesos.Protos; +import org.apache.mesos.Protos.FrameworkInfo; +import org.apache.mesos.SchedulerDriver; +import org.slf4j.Logger; +import scala.Option; + +import java.util.ArrayList; +import java.util.Collection; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +import static java.util.Objects.requireNonNull; + +/** + * Flink Resource Manager for Apache Mesos. + */ +public class MesosFlinkResourceManager extends FlinkResourceManager<RegisteredMesosWorkerNode> { + + /** The Mesos configuration (master and framework info) */ + private final MesosConfiguration mesosConfig; + + /** The TaskManager container parameters (like container memory size) */ + private final MesosTaskManagerParameters taskManagerParameters; + + /** Context information used to start a TaskManager Java process */ + private final Protos.TaskInfo.Builder taskManagerLaunchContext; + + /** Number of failed Mesos tasks before stopping the application. -1 means infinite. */ + private final int maxFailedTasks; + + /** Callback handler for the asynchronous Mesos scheduler */ + private SchedulerProxy schedulerCallbackHandler; + + /** Mesos scheduler driver */ + private SchedulerDriver schedulerDriver; + + private ActorRef connectionMonitor; + + private ActorRef taskRouter; + + private ActorRef launchCoordinator; + + private ActorRef reconciliationCoordinator; + + private MesosWorkerStore workerStore; + + final Map<ResourceID, MesosWorkerStore.Worker> workersInNew; + final Map<ResourceID, MesosWorkerStore.Worker> workersInLaunch; + final Map<ResourceID, MesosWorkerStore.Worker> workersBeingReturned; + + /** The number of failed tasks since the master became active */ + private int failedTasksSoFar; + + public MesosFlinkResourceManager( + Configuration flinkConfig, + MesosConfiguration mesosConfig, + MesosWorkerStore workerStore, + LeaderRetrievalService leaderRetrievalService, + MesosTaskManagerParameters taskManagerParameters, + Protos.TaskInfo.Builder taskManagerLaunchContext, + int maxFailedTasks, + int numInitialTaskManagers) { + + super(numInitialTaskManagers, flinkConfig, leaderRetrievalService); + + this.mesosConfig = requireNonNull(mesosConfig); + + this.workerStore = requireNonNull(workerStore); + + this.taskManagerParameters = requireNonNull(taskManagerParameters); + this.taskManagerLaunchContext = requireNonNull(taskManagerLaunchContext); + this.maxFailedTasks = maxFailedTasks; + + this.workersInNew = new HashMap<>(); + this.workersInLaunch = new HashMap<>(); + this.workersBeingReturned = new HashMap<>(); + } + + // ------------------------------------------------------------------------ + // Mesos-specific behavior + // ------------------------------------------------------------------------ + + @Override + protected void initialize() throws Exception { + LOG.info("Initializing Mesos resource master"); + + workerStore.start(); + + // create the scheduler driver to communicate with Mesos + schedulerCallbackHandler = new SchedulerProxy(self()); + + // register with Mesos + FrameworkInfo.Builder frameworkInfo = mesosConfig.frameworkInfo() + .clone() + .setCheckpoint(true); + + Option<Protos.FrameworkID> frameworkID = workerStore.getFrameworkID(); + if(frameworkID.isEmpty()) { + LOG.info("Registering as new framework."); + } + else { + LOG.info("Recovery scenario: re-registering using framework ID {}.", frameworkID.get().getValue()); + frameworkInfo.setId(frameworkID.get()); + } + + MesosConfiguration initializedMesosConfig = mesosConfig.withFrameworkInfo(frameworkInfo); + MesosConfiguration.logMesosConfig(LOG, initializedMesosConfig); + schedulerDriver = initializedMesosConfig.createDriver(schedulerCallbackHandler, false); + + // create supporting actors + connectionMonitor = createConnectionMonitor(); + launchCoordinator = createLaunchCoordinator(); + reconciliationCoordinator = createReconciliationCoordinator(); + taskRouter = createTaskRouter(); + + recoverWorkers(); + + connectionMonitor.tell(new ConnectionMonitor.Start(), self()); + schedulerDriver.start(); + } + + protected ActorRef createConnectionMonitor() { + return context().actorOf( + ConnectionMonitor.createActorProps(ConnectionMonitor.class, config), + "connectionMonitor"); + } + + protected ActorRef createTaskRouter() { + return context().actorOf( + Tasks.createActorProps(Tasks.class, config, schedulerDriver, TaskMonitor.class), + "tasks"); + } + + protected ActorRef createLaunchCoordinator() { + return context().actorOf( + LaunchCoordinator.createActorProps(LaunchCoordinator.class, self(), config, schedulerDriver, createOptimizer()), + "launchCoordinator"); + } + + protected ActorRef createReconciliationCoordinator() { + return context().actorOf( + ReconciliationCoordinator.createActorProps(ReconciliationCoordinator.class, config, schedulerDriver), + "reconciliationCoordinator"); + } + + @Override + public void postStop() { + LOG.info("Stopping Mesos resource master"); + super.postStop(); + } + + // ------------------------------------------------------------------------ + // Actor messages + // ------------------------------------------------------------------------ + + @Override + protected void handleMessage(Object message) { + + // check for Mesos-specific actor messages first + + // — messages about Mesos connection + if (message instanceof Registered) { + registered((Registered) message); + } else if (message instanceof ReRegistered) { + reregistered((ReRegistered) message); + } else if (message instanceof Disconnected) { + disconnected((Disconnected) message); + } else if (message instanceof Error) { + error(((Error) message).message()); + + // --- messages about offers + } else if (message instanceof ResourceOffers || message instanceof OfferRescinded) { + launchCoordinator.tell(message, self()); + } else if (message instanceof AcceptOffers) { + acceptOffers((AcceptOffers) message); + + // --- messages about tasks + } else if (message instanceof StatusUpdate) { + taskStatusUpdated((StatusUpdate) message); + } else if (message instanceof ReconciliationCoordinator.Reconcile) { + // a reconciliation request from a task + reconciliationCoordinator.tell(message, self()); + } else if (message instanceof TaskMonitor.TaskTerminated) { + // a termination message from a task + TaskMonitor.TaskTerminated msg = (TaskMonitor.TaskTerminated) message; + taskTerminated(msg.taskID(), msg.status()); + + } else { + // message handled by the generic resource master code + super.handleMessage(message); + } + } + + /** + * Called to shut down the cluster (not a failover situation). + * + * @param finalStatus The application status to report. + * @param optionalDiagnostics An optional diagnostics message. + */ + @Override + protected void shutdownApplication(ApplicationStatus finalStatus, String optionalDiagnostics) { + + LOG.info("Shutting down and unregistering as a Mesos framework."); + try { + // unregister the framework, which implicitly removes all tasks. + schedulerDriver.stop(false); + } + catch(Exception ex) { + LOG.warn("unable to unregister the framework", ex); + } + + try { + workerStore.cleanup(); + } + catch(Exception ex) { + LOG.warn("unable to cleanup the ZooKeeper state", ex); + } + + context().stop(self()); + } + + @Override + protected void fatalError(String message, Throwable error) { + // we do not unregister, but cause a hard fail of this process, to have it + // restarted by the dispatcher + LOG.error("FATAL ERROR IN MESOS APPLICATION MASTER: " + message, error); + LOG.error("Shutting down process"); + + // kill this process, this will make an external supervisor (the dispatcher) restart the process + System.exit(EXIT_CODE_FATAL_ERROR); + } + + // ------------------------------------------------------------------------ + // Worker Management + // ------------------------------------------------------------------------ + + /** + * Recover framework/worker information persisted by a prior incarnation of the RM. + */ + private void recoverWorkers() throws Exception { + // if this application master starts as part of an ApplicationMaster/JobManager recovery, + // then some worker tasks are most likely still alive and we can re-obtain them + final List<MesosWorkerStore.Worker> tasksFromPreviousAttempts = workerStore.recoverWorkers(); + + if (!tasksFromPreviousAttempts.isEmpty()) { + LOG.info("Retrieved {} TaskManagers from previous attempt", tasksFromPreviousAttempts.size()); + + List<Tuple2<TaskRequest,String>> toAssign = new ArrayList<>(tasksFromPreviousAttempts.size()); + List<LaunchableTask> toLaunch = new ArrayList<>(tasksFromPreviousAttempts.size()); + + for (final MesosWorkerStore.Worker worker : tasksFromPreviousAttempts) { + LaunchableMesosWorker launchable = createLaunchableMesosWorker(worker.taskID()); + + switch(worker.state()) { + case New: + workersInNew.put(extractResourceID(worker.taskID()), worker); + toLaunch.add(launchable); + break; + case Launched: + workersInLaunch.put(extractResourceID(worker.taskID()), worker); + toAssign.add(new Tuple2<>(launchable.taskRequest(), worker.hostname().get())); + break; + case Released: + workersBeingReturned.put(extractResourceID(worker.taskID()), worker); + break; + } + taskRouter.tell(new TaskMonitor.TaskGoalStateUpdated(extractGoalState(worker)), self()); + } + + // tell the launch coordinator about prior assignments + if(toAssign.size() >= 1) { + launchCoordinator.tell(new LaunchCoordinator.Assign(toAssign), self()); + } + // tell the launch coordinator to launch any new tasks + if(toLaunch.size() >= 1) { + launchCoordinator.tell(new LaunchCoordinator.Launch(toLaunch), self()); + } + } + } + + /** + * Plan for some additional workers to be launched. + * + * @param numWorkers The number of workers to allocate. + */ + @Override + protected void requestNewWorkers(int numWorkers) { + + try { + List<TaskMonitor.TaskGoalStateUpdated> toMonitor = new ArrayList<>(numWorkers); + List<LaunchableTask> toLaunch = new ArrayList<>(numWorkers); + + // generate new workers into persistent state and launch associated actors + for (int i = 0; i < numWorkers; i++) { + MesosWorkerStore.Worker worker = MesosWorkerStore.Worker.newTask(workerStore.newTaskID()); + workerStore.putWorker(worker); + workersInNew.put(extractResourceID(worker.taskID()), worker); + + LaunchableMesosWorker launchable = createLaunchableMesosWorker(worker.taskID()); + + LOG.info("Scheduling Mesos task {} with ({} MB, {} cpus).", + launchable.taskID().getValue(), launchable.taskRequest().getMemory(), launchable.taskRequest().getCPUs()); + + toMonitor.add(new TaskMonitor.TaskGoalStateUpdated(extractGoalState(worker))); + toLaunch.add(launchable); + } + + // tell the task router about the new plans + for (TaskMonitor.TaskGoalStateUpdated update : toMonitor) { + taskRouter.tell(update, self()); + } + + // tell the launch coordinator to launch the new tasks + if(toLaunch.size() >= 1) { + launchCoordinator.tell(new LaunchCoordinator.Launch(toLaunch), self()); + } + } + catch(Exception ex) { + fatalError("unable to request new workers", ex); + } + } + + /** + * Accept offers as advised by the launch coordinator. + * + * Acceptance is routed through the RM to update the persistent state before + * forwarding the message to Mesos. + */ + private void acceptOffers(AcceptOffers msg) { + + try { + List<TaskMonitor.TaskGoalStateUpdated> toMonitor = new ArrayList<>(msg.operations().size()); + + // transition the persistent state of some tasks to Launched + for (Protos.Offer.Operation op : msg.operations()) { + if (op.getType() != Protos.Offer.Operation.Type.LAUNCH) { + continue; + } + for (Protos.TaskInfo info : op.getLaunch().getTaskInfosList()) { + MesosWorkerStore.Worker worker = workersInNew.remove(extractResourceID(info.getTaskId())); + assert (worker != null); + + worker = worker.launchTask(info.getSlaveId(), msg.hostname()); + workerStore.putWorker(worker); + workersInLaunch.put(extractResourceID(worker.taskID()), worker); + + LOG.info("Launching Mesos task {} on host {}.", + worker.taskID().getValue(), worker.hostname().get()); + + toMonitor.add(new TaskMonitor.TaskGoalStateUpdated(extractGoalState(worker))); + } + } + + // tell the task router about the new plans + for (TaskMonitor.TaskGoalStateUpdated update : toMonitor) { + taskRouter.tell(update, self()); + } + + // send the acceptance message to Mesos + schedulerDriver.acceptOffers(msg.offerIds(), msg.operations(), msg.filters()); + } + catch(Exception ex) { + fatalError("unable to accept offers", ex); + } + } + + /** + * Handle a task status change. + */ + private void taskStatusUpdated(StatusUpdate message) { + taskRouter.tell(message, self()); + reconciliationCoordinator.tell(message, self()); + schedulerDriver.acknowledgeStatusUpdate(message.status()); + } + + /** + * Accept the given started worker into the internal state. + * + * @param resourceID The worker resource id + * @return A registered worker node record. + */ + @Override + protected RegisteredMesosWorkerNode workerStarted(ResourceID resourceID) { + MesosWorkerStore.Worker inLaunch = workersInLaunch.remove(resourceID); + if (inLaunch == null) { + // Worker was not in state "being launched", this can indicate that the TaskManager + // in this worker was already registered or that the container was not started + // by this resource manager. Simply ignore this resourceID. + return null; + } + return new RegisteredMesosWorkerNode(inLaunch); + } + + /** + * Accept the given registered workers into the internal state. + * + * @param toConsolidate The worker IDs known previously to the JobManager. + * @return A collection of registered worker node records. + */ + @Override + protected Collection<RegisteredMesosWorkerNode> reacceptRegisteredWorkers(Collection<ResourceID> toConsolidate) { + + // we check for each task manager if we recognize its Mesos task ID + List<RegisteredMesosWorkerNode> accepted = new ArrayList<>(toConsolidate.size()); + for (ResourceID resourceID : toConsolidate) { + MesosWorkerStore.Worker worker = workersInLaunch.remove(resourceID); + if (worker != null) { + LOG.info("Mesos worker consolidation recognizes TaskManager {}.", resourceID); + accepted.add(new RegisteredMesosWorkerNode(worker)); + } + else { + if(isStarted(resourceID)) { + LOG.info("TaskManager {} has already been registered at the resource manager.", resourceID); + } + else { + LOG.info("Mesos worker consolidation does not recognize TaskManager {}.", resourceID); + } + } + } + return accepted; + } + + /** + * Release the given pending worker. + */ + @Override + protected void releasePendingWorker(ResourceID id) { + MesosWorkerStore.Worker worker = workersInLaunch.remove(id); + if (worker != null) { + releaseWorker(worker); + } else { + LOG.error("Cannot find worker {} to release. Ignoring request.", id); + } + } + + /** + * Release the given started worker. + */ + @Override + protected void releaseStartedWorker(RegisteredMesosWorkerNode worker) { + releaseWorker(worker.task()); + } + + /** + * Plan for the removal of the given worker. + */ + private void releaseWorker(MesosWorkerStore.Worker worker) { + try { + LOG.info("Releasing worker {}", worker.taskID()); + + // update persistent state of worker to Released + worker = worker.releaseTask(); + workerStore.putWorker(worker); + workersBeingReturned.put(extractResourceID(worker.taskID()), worker); + taskRouter.tell(new TaskMonitor.TaskGoalStateUpdated(extractGoalState(worker)), self()); + + if (worker.hostname().isDefined()) { + // tell the launch coordinator that the task is being unassigned from the host, for planning purposes + launchCoordinator.tell(new LaunchCoordinator.Unassign(worker.taskID(), worker.hostname().get()), self()); + } + } + catch (Exception ex) { + fatalError("unable to release worker", ex); + } + } + + @Override + protected int getNumWorkerRequestsPending() { + return workersInNew.size(); + } + + @Override + protected int getNumWorkersPendingRegistration() { + return workersInLaunch.size(); + } + + // ------------------------------------------------------------------------ + // Callbacks from the Mesos Master + // ------------------------------------------------------------------------ + + /** + * Called when connected to Mesos as a new framework. + */ + private void registered(Registered message) { + connectionMonitor.tell(message, self()); + + try { + workerStore.setFrameworkID(Option.apply(message.frameworkId())); + } + catch(Exception ex) { + fatalError("unable to store the assigned framework ID", ex); + return; + } + + launchCoordinator.tell(message, self()); + reconciliationCoordinator.tell(message, self()); + taskRouter.tell(message, self()); + } + + /** + * Called when reconnected to Mesos following a failover event. + */ + private void reregistered(ReRegistered message) { + connectionMonitor.tell(message, self()); + launchCoordinator.tell(message, self()); + reconciliationCoordinator.tell(message, self()); + taskRouter.tell(message, self()); + } + + /** + * Called when disconnected from Mesos. + */ + private void disconnected(Disconnected message) { + connectionMonitor.tell(message, self()); + launchCoordinator.tell(message, self()); + reconciliationCoordinator.tell(message, self()); + taskRouter.tell(message, self()); + } + + /** + * Called when an error is reported by the scheduler callback. + */ + private void error(String message) { + self().tell(new FatalErrorOccurred("Connection to Mesos failed", new Exception(message)), self()); + } + + /** + * Invoked when a Mesos task reaches a terminal status. + */ + private void taskTerminated(Protos.TaskID taskID, Protos.TaskStatus status) { + // this callback occurs for failed containers and for released containers alike + + final ResourceID id = extractResourceID(taskID); + + try { + workerStore.removeWorker(taskID); + } + catch(Exception ex) { + fatalError("unable to remove worker", ex); + return; + } + + // check if this is a failed task or a released task + if (workersBeingReturned.remove(id) != null) { + // regular finished worker that we released + LOG.info("Worker {} finished successfully with diagnostics: {}", + id, status.getMessage()); + } else { + // failed worker, either at startup, or running + final MesosWorkerStore.Worker launched = workersInLaunch.remove(id); + if (launched != null) { + LOG.info("Mesos task {} failed, with a TaskManager in launch or registration. " + + "State: {} Reason: {} ({})", id, status.getState(), status.getReason(), status.getMessage()); + // we will trigger re-acquiring new workers at the end + } else { + // failed registered worker + LOG.info("Mesos task {} failed, with a registered TaskManager. " + + "State: {} Reason: {} ({})", id, status.getState(), status.getReason(), status.getMessage()); + + // notify the generic logic, which notifies the JobManager, etc. + notifyWorkerFailed(id, "Mesos task " + id + " failed. State: " + status.getState()); + } + + // general failure logging + failedTasksSoFar++; + + String diagMessage = String.format("Diagnostics for task %s in state %s : " + + "reason=%s message=%s", + id, status.getState(), status.getReason(), status.getMessage()); + sendInfoMessage(diagMessage); + + LOG.info(diagMessage); + LOG.info("Total number of failed tasks so far: " + failedTasksSoFar); + + // maxFailedTasks == -1 is infinite number of retries. + if (maxFailedTasks >= 0 && failedTasksSoFar > maxFailedTasks) { + String msg = "Stopping Mesos session because the number of failed tasks (" + + failedTasksSoFar + ") exceeded the maximum failed tasks (" + + maxFailedTasks + "). This number is controlled by the '" + + ConfigConstants.MESOS_MAX_FAILED_TASKS + "' configuration setting. " + + "By default its the number of requested tasks."; + + LOG.error(msg); + self().tell(decorateMessage(new StopCluster(ApplicationStatus.FAILED, msg)), + ActorRef.noSender()); + + // no need to do anything else + return; + } + } + + // in case failed containers were among the finished containers, make + // sure we re-examine and request new ones + triggerCheckWorkers(); + } + + // ------------------------------------------------------------------------ + // Utilities + // ------------------------------------------------------------------------ + + private LaunchableMesosWorker createLaunchableMesosWorker(Protos.TaskID taskID) { + LaunchableMesosWorker launchable = + new LaunchableMesosWorker(taskManagerParameters, taskManagerLaunchContext, taskID); + return launchable; + } + + /** + * Extracts a unique ResourceID from the Mesos task. + * + * @param taskId the Mesos TaskID + * @return The ResourceID for the container + */ + static ResourceID extractResourceID(Protos.TaskID taskId) { + return new ResourceID(taskId.getValue()); + } + + /** + * Extracts the Mesos task goal state from the worker information. + * @param worker the persistent worker information. + * @return goal state information for the {@Link TaskMonitor} . + */ + static TaskMonitor.TaskGoalState extractGoalState(MesosWorkerStore.Worker worker) { + switch(worker.state()) { + case New: return new TaskMonitor.New(worker.taskID()); + case Launched: return new TaskMonitor.Launched(worker.taskID(), worker.slaveID().get()); + case Released: return new TaskMonitor.Released(worker.taskID(), worker.slaveID().get()); + default: throw new IllegalArgumentException(); + } + } + + /** + * Creates the Fenzo optimizer (builder). + * The builder is an indirection to faciliate unit testing of the Launch Coordinator. — End diff – facilitate
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75279529

          — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/MesosFlinkResourceManager.java —
          @@ -0,0 +1,755 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.runtime.clusterframework;
          +
          +import akka.actor.ActorRef;
          +import akka.actor.Props;
          +import com.netflix.fenzo.TaskRequest;
          +import com.netflix.fenzo.TaskScheduler;
          +import com.netflix.fenzo.VirtualMachineLease;
          +import com.netflix.fenzo.functions.Action1;
          +import org.apache.flink.api.java.tuple.Tuple2;
          +import org.apache.flink.configuration.ConfigConstants;
          +import org.apache.flink.configuration.Configuration;
          +import org.apache.flink.mesos.runtime.clusterframework.store.MesosWorkerStore;
          +import org.apache.flink.mesos.scheduler.ConnectionMonitor;
          +import org.apache.flink.mesos.scheduler.LaunchableTask;
          +import org.apache.flink.mesos.scheduler.LaunchCoordinator;
          +import org.apache.flink.mesos.scheduler.ReconciliationCoordinator;
          +import org.apache.flink.mesos.scheduler.SchedulerProxy;
          +import org.apache.flink.mesos.scheduler.TaskMonitor;
          +import org.apache.flink.mesos.scheduler.TaskSchedulerBuilder;
          +import org.apache.flink.mesos.scheduler.Tasks;
          +import org.apache.flink.mesos.scheduler.messages.AcceptOffers;
          +import org.apache.flink.mesos.scheduler.messages.Disconnected;
          +import org.apache.flink.mesos.scheduler.messages.Error;
          +import org.apache.flink.mesos.scheduler.messages.OfferRescinded;
          +import org.apache.flink.mesos.scheduler.messages.ReRegistered;
          +import org.apache.flink.mesos.scheduler.messages.Registered;
          +import org.apache.flink.mesos.scheduler.messages.ResourceOffers;
          +import org.apache.flink.mesos.scheduler.messages.StatusUpdate;
          +import org.apache.flink.mesos.util.MesosConfiguration;
          +import org.apache.flink.runtime.clusterframework.ApplicationStatus;
          +import org.apache.flink.runtime.clusterframework.FlinkResourceManager;
          +import org.apache.flink.runtime.clusterframework.messages.FatalErrorOccurred;
          +import org.apache.flink.runtime.clusterframework.messages.StopCluster;
          +import org.apache.flink.runtime.clusterframework.types.ResourceID;
          +import org.apache.flink.runtime.leaderretrieval.LeaderRetrievalService;
          +import org.apache.mesos.Protos;
          +import org.apache.mesos.Protos.FrameworkInfo;
          +import org.apache.mesos.SchedulerDriver;
          +import org.slf4j.Logger;
          +import scala.Option;
          +
          +import java.util.ArrayList;
          +import java.util.Collection;
          +import java.util.HashMap;
          +import java.util.List;
          +import java.util.Map;
          +
          +import static java.util.Objects.requireNonNull;
          +
          +/**
          + * Flink Resource Manager for Apache Mesos.
          + */
          +public class MesosFlinkResourceManager extends FlinkResourceManager<RegisteredMesosWorkerNode> {
          +
          + /** The Mesos configuration (master and framework info) */
          + private final MesosConfiguration mesosConfig;
          +
          + /** The TaskManager container parameters (like container memory size) */
          + private final MesosTaskManagerParameters taskManagerParameters;
          +
          + /** Context information used to start a TaskManager Java process */
          + private final Protos.TaskInfo.Builder taskManagerLaunchContext;
          +
          + /** Number of failed Mesos tasks before stopping the application. -1 means infinite. */
          + private final int maxFailedTasks;
          +
          + /** Callback handler for the asynchronous Mesos scheduler */
          + private SchedulerProxy schedulerCallbackHandler;
          +
          + /** Mesos scheduler driver */
          + private SchedulerDriver schedulerDriver;
          +
          + private ActorRef connectionMonitor;
          +
          + private ActorRef taskRouter;
          +
          + private ActorRef launchCoordinator;
          +
          + private ActorRef reconciliationCoordinator;
          +
          + private MesosWorkerStore workerStore;
          +
          + final Map<ResourceID, MesosWorkerStore.Worker> workersInNew;
          + final Map<ResourceID, MesosWorkerStore.Worker> workersInLaunch;
          + final Map<ResourceID, MesosWorkerStore.Worker> workersBeingReturned;
          +
          + /** The number of failed tasks since the master became active */
          + private int failedTasksSoFar;
          +
          + public MesosFlinkResourceManager(
          + Configuration flinkConfig,
          + MesosConfiguration mesosConfig,
          + MesosWorkerStore workerStore,
          + LeaderRetrievalService leaderRetrievalService,
          + MesosTaskManagerParameters taskManagerParameters,
          + Protos.TaskInfo.Builder taskManagerLaunchContext,
          + int maxFailedTasks,
          + int numInitialTaskManagers)

          { + + super(numInitialTaskManagers, flinkConfig, leaderRetrievalService); + + this.mesosConfig = requireNonNull(mesosConfig); + + this.workerStore = requireNonNull(workerStore); + + this.taskManagerParameters = requireNonNull(taskManagerParameters); + this.taskManagerLaunchContext = requireNonNull(taskManagerLaunchContext); + this.maxFailedTasks = maxFailedTasks; + + this.workersInNew = new HashMap<>(); + this.workersInLaunch = new HashMap<>(); + this.workersBeingReturned = new HashMap<>(); + }

          +
          + // ------------------------------------------------------------------------
          + // Mesos-specific behavior
          + // ------------------------------------------------------------------------
          +
          + @Override
          + protected void initialize() throws Exception {
          + LOG.info("Initializing Mesos resource master");
          +
          + workerStore.start();
          +
          + // create the scheduler driver to communicate with Mesos
          + schedulerCallbackHandler = new SchedulerProxy(self());
          +
          + // register with Mesos
          + FrameworkInfo.Builder frameworkInfo = mesosConfig.frameworkInfo()
          + .clone()
          + .setCheckpoint(true);
          +
          + Option<Protos.FrameworkID> frameworkID = workerStore.getFrameworkID();
          + if(frameworkID.isEmpty())

          { + LOG.info("Registering as new framework."); + }

          + else {
          + LOG.info("Recovery scenario: re-registering using framework ID {}.", frameworkID.get().getValue());
          + frameworkInfo.setId(frameworkID.get());
          + }
          +
          + MesosConfiguration initializedMesosConfig = mesosConfig.withFrameworkInfo(frameworkInfo);
          + MesosConfiguration.logMesosConfig(LOG, initializedMesosConfig);
          + schedulerDriver = initializedMesosConfig.createDriver(schedulerCallbackHandler, false);
          +
          + // create supporting actors
          + connectionMonitor = createConnectionMonitor();
          + launchCoordinator = createLaunchCoordinator();
          + reconciliationCoordinator = createReconciliationCoordinator();
          + taskRouter = createTaskRouter();
          +
          + recoverWorkers();
          +
          + connectionMonitor.tell(new ConnectionMonitor.Start(), self());
          + schedulerDriver.start();
          + }
          +
          + protected ActorRef createConnectionMonitor()

          { + return context().actorOf( + ConnectionMonitor.createActorProps(ConnectionMonitor.class, config), + "connectionMonitor"); + }

          +
          + protected ActorRef createTaskRouter()

          { + return context().actorOf( + Tasks.createActorProps(Tasks.class, config, schedulerDriver, TaskMonitor.class), + "tasks"); + }

          +
          + protected ActorRef createLaunchCoordinator()

          { + return context().actorOf( + LaunchCoordinator.createActorProps(LaunchCoordinator.class, self(), config, schedulerDriver, createOptimizer()), + "launchCoordinator"); + }

          +
          + protected ActorRef createReconciliationCoordinator()

          { + return context().actorOf( + ReconciliationCoordinator.createActorProps(ReconciliationCoordinator.class, config, schedulerDriver), + "reconciliationCoordinator"); + }

          +
          + @Override
          + public void postStop()

          { + LOG.info("Stopping Mesos resource master"); + super.postStop(); + }

          +
          + // ------------------------------------------------------------------------
          + // Actor messages
          + // ------------------------------------------------------------------------
          +
          + @Override
          + protected void handleMessage(Object message) {
          +
          + // check for Mesos-specific actor messages first
          +
          + // — messages about Mesos connection
          + if (message instanceof Registered)

          { + registered((Registered) message); + }

          else if (message instanceof ReRegistered)

          { + reregistered((ReRegistered) message); + }

          else if (message instanceof Disconnected)

          { + disconnected((Disconnected) message); + }

          else if (message instanceof Error)

          { + error(((Error) message).message()); + + // --- messages about offers + }

          else if (message instanceof ResourceOffers || message instanceof OfferRescinded)

          { + launchCoordinator.tell(message, self()); + }

          else if (message instanceof AcceptOffers)

          { + acceptOffers((AcceptOffers) message); + + // --- messages about tasks + }

          else if (message instanceof StatusUpdate)

          { + taskStatusUpdated((StatusUpdate) message); + }

          else if (message instanceof ReconciliationCoordinator.Reconcile)

          { + // a reconciliation request from a task + reconciliationCoordinator.tell(message, self()); + }

          else if (message instanceof TaskMonitor.TaskTerminated)

          { + // a termination message from a task + TaskMonitor.TaskTerminated msg = (TaskMonitor.TaskTerminated) message; + taskTerminated(msg.taskID(), msg.status()); + + }

          else

          { + // message handled by the generic resource master code + super.handleMessage(message); + }

          + }
          +
          + /**
          + * Called to shut down the cluster (not a failover situation).
          + *
          + * @param finalStatus The application status to report.
          + * @param optionalDiagnostics An optional diagnostics message.
          + */
          + @Override
          + protected void shutdownApplication(ApplicationStatus finalStatus, String optionalDiagnostics) {
          +
          + LOG.info("Shutting down and unregistering as a Mesos framework.");
          + try

          { + // unregister the framework, which implicitly removes all tasks. + schedulerDriver.stop(false); + }

          + catch(Exception ex)

          { + LOG.warn("unable to unregister the framework", ex); + }

          +
          + try

          { + workerStore.cleanup(); + }

          + catch(Exception ex)

          { + LOG.warn("unable to cleanup the ZooKeeper state", ex); + }

          +
          + context().stop(self());
          + }
          +
          + @Override
          + protected void fatalError(String message, Throwable error)

          { + // we do not unregister, but cause a hard fail of this process, to have it + // restarted by the dispatcher + LOG.error("FATAL ERROR IN MESOS APPLICATION MASTER: " + message, error); + LOG.error("Shutting down process"); + + // kill this process, this will make an external supervisor (the dispatcher) restart the process + System.exit(EXIT_CODE_FATAL_ERROR); + }

          +
          + // ------------------------------------------------------------------------
          + // Worker Management
          + // ------------------------------------------------------------------------
          +
          + /**
          + * Recover framework/worker information persisted by a prior incarnation of the RM.
          + */
          + private void recoverWorkers() throws Exception {
          + // if this application master starts as part of an ApplicationMaster/JobManager recovery,
          + // then some worker tasks are most likely still alive and we can re-obtain them
          + final List<MesosWorkerStore.Worker> tasksFromPreviousAttempts = workerStore.recoverWorkers();
          +
          + if (!tasksFromPreviousAttempts.isEmpty()) {
          + LOG.info("Retrieved {} TaskManagers from previous attempt", tasksFromPreviousAttempts.size());
          +
          + List<Tuple2<TaskRequest,String>> toAssign = new ArrayList<>(tasksFromPreviousAttempts.size());
          + List<LaunchableTask> toLaunch = new ArrayList<>(tasksFromPreviousAttempts.size());
          +
          + for (final MesosWorkerStore.Worker worker : tasksFromPreviousAttempts) {
          + LaunchableMesosWorker launchable = createLaunchableMesosWorker(worker.taskID());
          +
          + switch(worker.state())

          { + case New: + workersInNew.put(extractResourceID(worker.taskID()), worker); + toLaunch.add(launchable); + break; + case Launched: + workersInLaunch.put(extractResourceID(worker.taskID()), worker); + toAssign.add(new Tuple2<>(launchable.taskRequest(), worker.hostname().get())); + break; + case Released: + workersBeingReturned.put(extractResourceID(worker.taskID()), worker); + break; + }

          + taskRouter.tell(new TaskMonitor.TaskGoalStateUpdated(extractGoalState(worker)), self());
          + }
          +
          + // tell the launch coordinator about prior assignments
          + if(toAssign.size() >= 1)

          { + launchCoordinator.tell(new LaunchCoordinator.Assign(toAssign), self()); + }

          + // tell the launch coordinator to launch any new tasks
          + if(toLaunch.size() >= 1)

          { + launchCoordinator.tell(new LaunchCoordinator.Launch(toLaunch), self()); + }
          + }
          + }
          +
          + /**
          + * Plan for some additional workers to be launched.
          + *
          + * @param numWorkers The number of workers to allocate.
          + */
          + @Override
          + protected void requestNewWorkers(int numWorkers) {
          +
          + try {
          + List<TaskMonitor.TaskGoalStateUpdated> toMonitor = new ArrayList<>(numWorkers);
          + List<LaunchableTask> toLaunch = new ArrayList<>(numWorkers);
          +
          + // generate new workers into persistent state and launch associated actors
          + for (int i = 0; i < numWorkers; i++) {
          + MesosWorkerStore.Worker worker = MesosWorkerStore.Worker.newTask(workerStore.newTaskID());
          + workerStore.putWorker(worker);
          + workersInNew.put(extractResourceID(worker.taskID()), worker);
          +
          + LaunchableMesosWorker launchable = createLaunchableMesosWorker(worker.taskID());
          +
          + LOG.info("Scheduling Mesos task {} with ({} MB, {} cpus).",
          + launchable.taskID().getValue(), launchable.taskRequest().getMemory(), launchable.taskRequest().getCPUs());
          +
          + toMonitor.add(new TaskMonitor.TaskGoalStateUpdated(extractGoalState(worker)));
          + toLaunch.add(launchable);
          + }
          +
          + // tell the task router about the new plans
          + for (TaskMonitor.TaskGoalStateUpdated update : toMonitor) { + taskRouter.tell(update, self()); + }
          +
          + // tell the launch coordinator to launch the new tasks
          + if(toLaunch.size() >= 1) { + launchCoordinator.tell(new LaunchCoordinator.Launch(toLaunch), self()); + }

          + }
          + catch(Exception ex)

          { + fatalError("unable to request new workers", ex); + }

          + }
          +
          + /**
          + * Accept offers as advised by the launch coordinator.
          + *
          + * Acceptance is routed through the RM to update the persistent state before
          + * forwarding the message to Mesos.
          + */
          + private void acceptOffers(AcceptOffers msg) {
          +
          + try {
          + List<TaskMonitor.TaskGoalStateUpdated> toMonitor = new ArrayList<>(msg.operations().size());
          +
          + // transition the persistent state of some tasks to Launched
          + for (Protos.Offer.Operation op : msg.operations()) {
          + if (op.getType() != Protos.Offer.Operation.Type.LAUNCH)

          { + continue; + }

          + for (Protos.TaskInfo info : op.getLaunch().getTaskInfosList()) {
          + MesosWorkerStore.Worker worker = workersInNew.remove(extractResourceID(info.getTaskId()));
          + assert (worker != null);
          +
          + worker = worker.launchTask(info.getSlaveId(), msg.hostname());
          + workerStore.putWorker(worker);
          + workersInLaunch.put(extractResourceID(worker.taskID()), worker);
          +
          + LOG.info("Launching Mesos task {} on host {}.",
          + worker.taskID().getValue(), worker.hostname().get());
          +
          + toMonitor.add(new TaskMonitor.TaskGoalStateUpdated(extractGoalState(worker)));
          + }
          + }
          +
          + // tell the task router about the new plans
          + for (TaskMonitor.TaskGoalStateUpdated update : toMonitor)

          { + taskRouter.tell(update, self()); + }

          +
          + // send the acceptance message to Mesos
          + schedulerDriver.acceptOffers(msg.offerIds(), msg.operations(), msg.filters());
          + }
          + catch(Exception ex)

          { + fatalError("unable to accept offers", ex); + }

          + }
          +
          + /**
          + * Handle a task status change.
          + */
          + private void taskStatusUpdated(StatusUpdate message)

          { + taskRouter.tell(message, self()); + reconciliationCoordinator.tell(message, self()); + schedulerDriver.acknowledgeStatusUpdate(message.status()); + }

          +
          + /**
          + * Accept the given started worker into the internal state.
          + *
          + * @param resourceID The worker resource id
          + * @return A registered worker node record.
          + */
          + @Override
          + protected RegisteredMesosWorkerNode workerStarted(ResourceID resourceID) {
          + MesosWorkerStore.Worker inLaunch = workersInLaunch.remove(resourceID);
          + if (inLaunch == null)

          { + // Worker was not in state "being launched", this can indicate that the TaskManager + // in this worker was already registered or that the container was not started + // by this resource manager. Simply ignore this resourceID. + return null; + }

          + return new RegisteredMesosWorkerNode(inLaunch);
          + }
          +
          + /**
          + * Accept the given registered workers into the internal state.
          + *
          + * @param toConsolidate The worker IDs known previously to the JobManager.
          + * @return A collection of registered worker node records.
          + */
          + @Override
          + protected Collection<RegisteredMesosWorkerNode> reacceptRegisteredWorkers(Collection<ResourceID> toConsolidate) {
          +
          + // we check for each task manager if we recognize its Mesos task ID
          + List<RegisteredMesosWorkerNode> accepted = new ArrayList<>(toConsolidate.size());
          + for (ResourceID resourceID : toConsolidate) {
          + MesosWorkerStore.Worker worker = workersInLaunch.remove(resourceID);
          + if (worker != null) {
          + LOG.info("Mesos worker consolidation recognizes TaskManager {}.", resourceID);
          + accepted.add(new RegisteredMesosWorkerNode(worker));
          + }
          + else {
          + if(isStarted(resourceID)) {
          + LOG.info("TaskManager {} has already been registered at the resource manager.", resourceID);
          + }
          + else {
          + LOG.info("Mesos worker consolidation does not recognize TaskManager {}.", resourceID);
          + }
          + }
          + }
          + return accepted;
          + }
          +
          + /**
          + * Release the given pending worker.
          + */
          + @Override
          + protected void releasePendingWorker(ResourceID id) {
          + MesosWorkerStore.Worker worker = workersInLaunch.remove(id);
          + if (worker != null)

          { + releaseWorker(worker); + }

          else {
          + LOG.error("Cannot find worker {} to release. Ignoring request.", id);
          + }
          + }
          +
          + /**
          + * Release the given started worker.
          + */
          + @Override
          + protected void releaseStartedWorker(RegisteredMesosWorkerNode worker)

          { + releaseWorker(worker.task()); + }

          +
          + /**
          + * Plan for the removal of the given worker.
          + */
          + private void releaseWorker(MesosWorkerStore.Worker worker) {
          + try {
          + LOG.info("Releasing worker {}", worker.taskID());
          +
          + // update persistent state of worker to Released
          + worker = worker.releaseTask();
          + workerStore.putWorker(worker);
          + workersBeingReturned.put(extractResourceID(worker.taskID()), worker);
          + taskRouter.tell(new TaskMonitor.TaskGoalStateUpdated(extractGoalState(worker)), self());
          +
          + if (worker.hostname().isDefined())

          { + // tell the launch coordinator that the task is being unassigned from the host, for planning purposes + launchCoordinator.tell(new LaunchCoordinator.Unassign(worker.taskID(), worker.hostname().get()), self()); + }

          + }
          + catch (Exception ex)

          { + fatalError("unable to release worker", ex); + }

          + }
          +
          + @Override
          + protected int getNumWorkerRequestsPending()

          { + return workersInNew.size(); + }

          +
          + @Override
          + protected int getNumWorkersPendingRegistration()

          { + return workersInLaunch.size(); + }

          +
          + // ------------------------------------------------------------------------
          + // Callbacks from the Mesos Master
          + // ------------------------------------------------------------------------
          +
          + /**
          + * Called when connected to Mesos as a new framework.
          + */
          + private void registered(Registered message) {
          + connectionMonitor.tell(message, self());
          +
          + try

          { + workerStore.setFrameworkID(Option.apply(message.frameworkId())); + }

          + catch(Exception ex)

          { + fatalError("unable to store the assigned framework ID", ex); + return; + }

          +
          + launchCoordinator.tell(message, self());
          + reconciliationCoordinator.tell(message, self());
          + taskRouter.tell(message, self());
          + }
          +
          + /**
          + * Called when reconnected to Mesos following a failover event.
          + */
          + private void reregistered(ReRegistered message)

          { + connectionMonitor.tell(message, self()); + launchCoordinator.tell(message, self()); + reconciliationCoordinator.tell(message, self()); + taskRouter.tell(message, self()); + }
          +
          + /**
          + * Called when disconnected from Mesos.
          + */
          + private void disconnected(Disconnected message) { + connectionMonitor.tell(message, self()); + launchCoordinator.tell(message, self()); + reconciliationCoordinator.tell(message, self()); + taskRouter.tell(message, self()); + }

          +
          + /**
          + * Called when an error is reported by the scheduler callback.
          + */
          + private void error(String message)

          { + self().tell(new FatalErrorOccurred("Connection to Mesos failed", new Exception(message)), self()); + }

          +
          + /**
          + * Invoked when a Mesos task reaches a terminal status.
          + */
          + private void taskTerminated(Protos.TaskID taskID, Protos.TaskStatus status) {
          + // this callback occurs for failed containers and for released containers alike
          +
          + final ResourceID id = extractResourceID(taskID);
          +
          + try

          { + workerStore.removeWorker(taskID); + }

          + catch(Exception ex)

          { + fatalError("unable to remove worker", ex); + return; + }

          +
          + // check if this is a failed task or a released task
          + if (workersBeingReturned.remove(id) != null) {
          + // regular finished worker that we released
          + LOG.info("Worker {} finished successfully with diagnostics: {}",
          + id, status.getMessage());
          + } else {
          + // failed worker, either at startup, or running
          + final MesosWorkerStore.Worker launched = workersInLaunch.remove(id);
          + if (launched != null) {
          + LOG.info("Mesos task {} failed, with a TaskManager in launch or registration. " +
          + "State: {} Reason: {} ({})", id, status.getState(), status.getReason(), status.getMessage());
          + // we will trigger re-acquiring new workers at the end
          + } else {
          + // failed registered worker
          + LOG.info("Mesos task {} failed, with a registered TaskManager. " +
          + "State: {} Reason: {} ({})", id, status.getState(), status.getReason(), status.getMessage());
          +
          + // notify the generic logic, which notifies the JobManager, etc.
          + notifyWorkerFailed(id, "Mesos task " + id + " failed. State: " + status.getState());
          + }
          +
          + // general failure logging
          + failedTasksSoFar++;
          +
          + String diagMessage = String.format("Diagnostics for task %s in state %s : " +
          + "reason=%s message=%s",
          + id, status.getState(), status.getReason(), status.getMessage());
          + sendInfoMessage(diagMessage);
          +
          + LOG.info(diagMessage);
          + LOG.info("Total number of failed tasks so far: " + failedTasksSoFar);
          +
          + // maxFailedTasks == -1 is infinite number of retries.
          + if (maxFailedTasks >= 0 && failedTasksSoFar > maxFailedTasks)

          { + String msg = "Stopping Mesos session because the number of failed tasks (" + + failedTasksSoFar + ") exceeded the maximum failed tasks (" + + maxFailedTasks + "). This number is controlled by the '" + + ConfigConstants.MESOS_MAX_FAILED_TASKS + "' configuration setting. " + + "By default its the number of requested tasks."; + + LOG.error(msg); + self().tell(decorateMessage(new StopCluster(ApplicationStatus.FAILED, msg)), + ActorRef.noSender()); + + // no need to do anything else + return; + }

          + }
          +
          + // in case failed containers were among the finished containers, make
          + // sure we re-examine and request new ones
          + triggerCheckWorkers();
          + }
          +
          + // ------------------------------------------------------------------------
          + // Utilities
          + // ------------------------------------------------------------------------
          +
          + private LaunchableMesosWorker createLaunchableMesosWorker(Protos.TaskID taskID) {
          + LaunchableMesosWorker launchable =
          + new LaunchableMesosWorker(taskManagerParameters, taskManagerLaunchContext, taskID);
          + return launchable;
          — End diff –

          Could be just `return new LaunchableMesosWorker(taskManagerParameters, taskManagerLaunchContext, taskID);`.

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75279529 — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/MesosFlinkResourceManager.java — @@ -0,0 +1,755 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.runtime.clusterframework; + +import akka.actor.ActorRef; +import akka.actor.Props; +import com.netflix.fenzo.TaskRequest; +import com.netflix.fenzo.TaskScheduler; +import com.netflix.fenzo.VirtualMachineLease; +import com.netflix.fenzo.functions.Action1; +import org.apache.flink.api.java.tuple.Tuple2; +import org.apache.flink.configuration.ConfigConstants; +import org.apache.flink.configuration.Configuration; +import org.apache.flink.mesos.runtime.clusterframework.store.MesosWorkerStore; +import org.apache.flink.mesos.scheduler.ConnectionMonitor; +import org.apache.flink.mesos.scheduler.LaunchableTask; +import org.apache.flink.mesos.scheduler.LaunchCoordinator; +import org.apache.flink.mesos.scheduler.ReconciliationCoordinator; +import org.apache.flink.mesos.scheduler.SchedulerProxy; +import org.apache.flink.mesos.scheduler.TaskMonitor; +import org.apache.flink.mesos.scheduler.TaskSchedulerBuilder; +import org.apache.flink.mesos.scheduler.Tasks; +import org.apache.flink.mesos.scheduler.messages.AcceptOffers; +import org.apache.flink.mesos.scheduler.messages.Disconnected; +import org.apache.flink.mesos.scheduler.messages.Error; +import org.apache.flink.mesos.scheduler.messages.OfferRescinded; +import org.apache.flink.mesos.scheduler.messages.ReRegistered; +import org.apache.flink.mesos.scheduler.messages.Registered; +import org.apache.flink.mesos.scheduler.messages.ResourceOffers; +import org.apache.flink.mesos.scheduler.messages.StatusUpdate; +import org.apache.flink.mesos.util.MesosConfiguration; +import org.apache.flink.runtime.clusterframework.ApplicationStatus; +import org.apache.flink.runtime.clusterframework.FlinkResourceManager; +import org.apache.flink.runtime.clusterframework.messages.FatalErrorOccurred; +import org.apache.flink.runtime.clusterframework.messages.StopCluster; +import org.apache.flink.runtime.clusterframework.types.ResourceID; +import org.apache.flink.runtime.leaderretrieval.LeaderRetrievalService; +import org.apache.mesos.Protos; +import org.apache.mesos.Protos.FrameworkInfo; +import org.apache.mesos.SchedulerDriver; +import org.slf4j.Logger; +import scala.Option; + +import java.util.ArrayList; +import java.util.Collection; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +import static java.util.Objects.requireNonNull; + +/** + * Flink Resource Manager for Apache Mesos. + */ +public class MesosFlinkResourceManager extends FlinkResourceManager<RegisteredMesosWorkerNode> { + + /** The Mesos configuration (master and framework info) */ + private final MesosConfiguration mesosConfig; + + /** The TaskManager container parameters (like container memory size) */ + private final MesosTaskManagerParameters taskManagerParameters; + + /** Context information used to start a TaskManager Java process */ + private final Protos.TaskInfo.Builder taskManagerLaunchContext; + + /** Number of failed Mesos tasks before stopping the application. -1 means infinite. */ + private final int maxFailedTasks; + + /** Callback handler for the asynchronous Mesos scheduler */ + private SchedulerProxy schedulerCallbackHandler; + + /** Mesos scheduler driver */ + private SchedulerDriver schedulerDriver; + + private ActorRef connectionMonitor; + + private ActorRef taskRouter; + + private ActorRef launchCoordinator; + + private ActorRef reconciliationCoordinator; + + private MesosWorkerStore workerStore; + + final Map<ResourceID, MesosWorkerStore.Worker> workersInNew; + final Map<ResourceID, MesosWorkerStore.Worker> workersInLaunch; + final Map<ResourceID, MesosWorkerStore.Worker> workersBeingReturned; + + /** The number of failed tasks since the master became active */ + private int failedTasksSoFar; + + public MesosFlinkResourceManager( + Configuration flinkConfig, + MesosConfiguration mesosConfig, + MesosWorkerStore workerStore, + LeaderRetrievalService leaderRetrievalService, + MesosTaskManagerParameters taskManagerParameters, + Protos.TaskInfo.Builder taskManagerLaunchContext, + int maxFailedTasks, + int numInitialTaskManagers) { + + super(numInitialTaskManagers, flinkConfig, leaderRetrievalService); + + this.mesosConfig = requireNonNull(mesosConfig); + + this.workerStore = requireNonNull(workerStore); + + this.taskManagerParameters = requireNonNull(taskManagerParameters); + this.taskManagerLaunchContext = requireNonNull(taskManagerLaunchContext); + this.maxFailedTasks = maxFailedTasks; + + this.workersInNew = new HashMap<>(); + this.workersInLaunch = new HashMap<>(); + this.workersBeingReturned = new HashMap<>(); + } + + // ------------------------------------------------------------------------ + // Mesos-specific behavior + // ------------------------------------------------------------------------ + + @Override + protected void initialize() throws Exception { + LOG.info("Initializing Mesos resource master"); + + workerStore.start(); + + // create the scheduler driver to communicate with Mesos + schedulerCallbackHandler = new SchedulerProxy(self()); + + // register with Mesos + FrameworkInfo.Builder frameworkInfo = mesosConfig.frameworkInfo() + .clone() + .setCheckpoint(true); + + Option<Protos.FrameworkID> frameworkID = workerStore.getFrameworkID(); + if(frameworkID.isEmpty()) { + LOG.info("Registering as new framework."); + } + else { + LOG.info("Recovery scenario: re-registering using framework ID {}.", frameworkID.get().getValue()); + frameworkInfo.setId(frameworkID.get()); + } + + MesosConfiguration initializedMesosConfig = mesosConfig.withFrameworkInfo(frameworkInfo); + MesosConfiguration.logMesosConfig(LOG, initializedMesosConfig); + schedulerDriver = initializedMesosConfig.createDriver(schedulerCallbackHandler, false); + + // create supporting actors + connectionMonitor = createConnectionMonitor(); + launchCoordinator = createLaunchCoordinator(); + reconciliationCoordinator = createReconciliationCoordinator(); + taskRouter = createTaskRouter(); + + recoverWorkers(); + + connectionMonitor.tell(new ConnectionMonitor.Start(), self()); + schedulerDriver.start(); + } + + protected ActorRef createConnectionMonitor() { + return context().actorOf( + ConnectionMonitor.createActorProps(ConnectionMonitor.class, config), + "connectionMonitor"); + } + + protected ActorRef createTaskRouter() { + return context().actorOf( + Tasks.createActorProps(Tasks.class, config, schedulerDriver, TaskMonitor.class), + "tasks"); + } + + protected ActorRef createLaunchCoordinator() { + return context().actorOf( + LaunchCoordinator.createActorProps(LaunchCoordinator.class, self(), config, schedulerDriver, createOptimizer()), + "launchCoordinator"); + } + + protected ActorRef createReconciliationCoordinator() { + return context().actorOf( + ReconciliationCoordinator.createActorProps(ReconciliationCoordinator.class, config, schedulerDriver), + "reconciliationCoordinator"); + } + + @Override + public void postStop() { + LOG.info("Stopping Mesos resource master"); + super.postStop(); + } + + // ------------------------------------------------------------------------ + // Actor messages + // ------------------------------------------------------------------------ + + @Override + protected void handleMessage(Object message) { + + // check for Mesos-specific actor messages first + + // — messages about Mesos connection + if (message instanceof Registered) { + registered((Registered) message); + } else if (message instanceof ReRegistered) { + reregistered((ReRegistered) message); + } else if (message instanceof Disconnected) { + disconnected((Disconnected) message); + } else if (message instanceof Error) { + error(((Error) message).message()); + + // --- messages about offers + } else if (message instanceof ResourceOffers || message instanceof OfferRescinded) { + launchCoordinator.tell(message, self()); + } else if (message instanceof AcceptOffers) { + acceptOffers((AcceptOffers) message); + + // --- messages about tasks + } else if (message instanceof StatusUpdate) { + taskStatusUpdated((StatusUpdate) message); + } else if (message instanceof ReconciliationCoordinator.Reconcile) { + // a reconciliation request from a task + reconciliationCoordinator.tell(message, self()); + } else if (message instanceof TaskMonitor.TaskTerminated) { + // a termination message from a task + TaskMonitor.TaskTerminated msg = (TaskMonitor.TaskTerminated) message; + taskTerminated(msg.taskID(), msg.status()); + + } else { + // message handled by the generic resource master code + super.handleMessage(message); + } + } + + /** + * Called to shut down the cluster (not a failover situation). + * + * @param finalStatus The application status to report. + * @param optionalDiagnostics An optional diagnostics message. + */ + @Override + protected void shutdownApplication(ApplicationStatus finalStatus, String optionalDiagnostics) { + + LOG.info("Shutting down and unregistering as a Mesos framework."); + try { + // unregister the framework, which implicitly removes all tasks. + schedulerDriver.stop(false); + } + catch(Exception ex) { + LOG.warn("unable to unregister the framework", ex); + } + + try { + workerStore.cleanup(); + } + catch(Exception ex) { + LOG.warn("unable to cleanup the ZooKeeper state", ex); + } + + context().stop(self()); + } + + @Override + protected void fatalError(String message, Throwable error) { + // we do not unregister, but cause a hard fail of this process, to have it + // restarted by the dispatcher + LOG.error("FATAL ERROR IN MESOS APPLICATION MASTER: " + message, error); + LOG.error("Shutting down process"); + + // kill this process, this will make an external supervisor (the dispatcher) restart the process + System.exit(EXIT_CODE_FATAL_ERROR); + } + + // ------------------------------------------------------------------------ + // Worker Management + // ------------------------------------------------------------------------ + + /** + * Recover framework/worker information persisted by a prior incarnation of the RM. + */ + private void recoverWorkers() throws Exception { + // if this application master starts as part of an ApplicationMaster/JobManager recovery, + // then some worker tasks are most likely still alive and we can re-obtain them + final List<MesosWorkerStore.Worker> tasksFromPreviousAttempts = workerStore.recoverWorkers(); + + if (!tasksFromPreviousAttempts.isEmpty()) { + LOG.info("Retrieved {} TaskManagers from previous attempt", tasksFromPreviousAttempts.size()); + + List<Tuple2<TaskRequest,String>> toAssign = new ArrayList<>(tasksFromPreviousAttempts.size()); + List<LaunchableTask> toLaunch = new ArrayList<>(tasksFromPreviousAttempts.size()); + + for (final MesosWorkerStore.Worker worker : tasksFromPreviousAttempts) { + LaunchableMesosWorker launchable = createLaunchableMesosWorker(worker.taskID()); + + switch(worker.state()) { + case New: + workersInNew.put(extractResourceID(worker.taskID()), worker); + toLaunch.add(launchable); + break; + case Launched: + workersInLaunch.put(extractResourceID(worker.taskID()), worker); + toAssign.add(new Tuple2<>(launchable.taskRequest(), worker.hostname().get())); + break; + case Released: + workersBeingReturned.put(extractResourceID(worker.taskID()), worker); + break; + } + taskRouter.tell(new TaskMonitor.TaskGoalStateUpdated(extractGoalState(worker)), self()); + } + + // tell the launch coordinator about prior assignments + if(toAssign.size() >= 1) { + launchCoordinator.tell(new LaunchCoordinator.Assign(toAssign), self()); + } + // tell the launch coordinator to launch any new tasks + if(toLaunch.size() >= 1) { + launchCoordinator.tell(new LaunchCoordinator.Launch(toLaunch), self()); + } + } + } + + /** + * Plan for some additional workers to be launched. + * + * @param numWorkers The number of workers to allocate. + */ + @Override + protected void requestNewWorkers(int numWorkers) { + + try { + List<TaskMonitor.TaskGoalStateUpdated> toMonitor = new ArrayList<>(numWorkers); + List<LaunchableTask> toLaunch = new ArrayList<>(numWorkers); + + // generate new workers into persistent state and launch associated actors + for (int i = 0; i < numWorkers; i++) { + MesosWorkerStore.Worker worker = MesosWorkerStore.Worker.newTask(workerStore.newTaskID()); + workerStore.putWorker(worker); + workersInNew.put(extractResourceID(worker.taskID()), worker); + + LaunchableMesosWorker launchable = createLaunchableMesosWorker(worker.taskID()); + + LOG.info("Scheduling Mesos task {} with ({} MB, {} cpus).", + launchable.taskID().getValue(), launchable.taskRequest().getMemory(), launchable.taskRequest().getCPUs()); + + toMonitor.add(new TaskMonitor.TaskGoalStateUpdated(extractGoalState(worker))); + toLaunch.add(launchable); + } + + // tell the task router about the new plans + for (TaskMonitor.TaskGoalStateUpdated update : toMonitor) { + taskRouter.tell(update, self()); + } + + // tell the launch coordinator to launch the new tasks + if(toLaunch.size() >= 1) { + launchCoordinator.tell(new LaunchCoordinator.Launch(toLaunch), self()); + } + } + catch(Exception ex) { + fatalError("unable to request new workers", ex); + } + } + + /** + * Accept offers as advised by the launch coordinator. + * + * Acceptance is routed through the RM to update the persistent state before + * forwarding the message to Mesos. + */ + private void acceptOffers(AcceptOffers msg) { + + try { + List<TaskMonitor.TaskGoalStateUpdated> toMonitor = new ArrayList<>(msg.operations().size()); + + // transition the persistent state of some tasks to Launched + for (Protos.Offer.Operation op : msg.operations()) { + if (op.getType() != Protos.Offer.Operation.Type.LAUNCH) { + continue; + } + for (Protos.TaskInfo info : op.getLaunch().getTaskInfosList()) { + MesosWorkerStore.Worker worker = workersInNew.remove(extractResourceID(info.getTaskId())); + assert (worker != null); + + worker = worker.launchTask(info.getSlaveId(), msg.hostname()); + workerStore.putWorker(worker); + workersInLaunch.put(extractResourceID(worker.taskID()), worker); + + LOG.info("Launching Mesos task {} on host {}.", + worker.taskID().getValue(), worker.hostname().get()); + + toMonitor.add(new TaskMonitor.TaskGoalStateUpdated(extractGoalState(worker))); + } + } + + // tell the task router about the new plans + for (TaskMonitor.TaskGoalStateUpdated update : toMonitor) { + taskRouter.tell(update, self()); + } + + // send the acceptance message to Mesos + schedulerDriver.acceptOffers(msg.offerIds(), msg.operations(), msg.filters()); + } + catch(Exception ex) { + fatalError("unable to accept offers", ex); + } + } + + /** + * Handle a task status change. + */ + private void taskStatusUpdated(StatusUpdate message) { + taskRouter.tell(message, self()); + reconciliationCoordinator.tell(message, self()); + schedulerDriver.acknowledgeStatusUpdate(message.status()); + } + + /** + * Accept the given started worker into the internal state. + * + * @param resourceID The worker resource id + * @return A registered worker node record. + */ + @Override + protected RegisteredMesosWorkerNode workerStarted(ResourceID resourceID) { + MesosWorkerStore.Worker inLaunch = workersInLaunch.remove(resourceID); + if (inLaunch == null) { + // Worker was not in state "being launched", this can indicate that the TaskManager + // in this worker was already registered or that the container was not started + // by this resource manager. Simply ignore this resourceID. + return null; + } + return new RegisteredMesosWorkerNode(inLaunch); + } + + /** + * Accept the given registered workers into the internal state. + * + * @param toConsolidate The worker IDs known previously to the JobManager. + * @return A collection of registered worker node records. + */ + @Override + protected Collection<RegisteredMesosWorkerNode> reacceptRegisteredWorkers(Collection<ResourceID> toConsolidate) { + + // we check for each task manager if we recognize its Mesos task ID + List<RegisteredMesosWorkerNode> accepted = new ArrayList<>(toConsolidate.size()); + for (ResourceID resourceID : toConsolidate) { + MesosWorkerStore.Worker worker = workersInLaunch.remove(resourceID); + if (worker != null) { + LOG.info("Mesos worker consolidation recognizes TaskManager {}.", resourceID); + accepted.add(new RegisteredMesosWorkerNode(worker)); + } + else { + if(isStarted(resourceID)) { + LOG.info("TaskManager {} has already been registered at the resource manager.", resourceID); + } + else { + LOG.info("Mesos worker consolidation does not recognize TaskManager {}.", resourceID); + } + } + } + return accepted; + } + + /** + * Release the given pending worker. + */ + @Override + protected void releasePendingWorker(ResourceID id) { + MesosWorkerStore.Worker worker = workersInLaunch.remove(id); + if (worker != null) { + releaseWorker(worker); + } else { + LOG.error("Cannot find worker {} to release. Ignoring request.", id); + } + } + + /** + * Release the given started worker. + */ + @Override + protected void releaseStartedWorker(RegisteredMesosWorkerNode worker) { + releaseWorker(worker.task()); + } + + /** + * Plan for the removal of the given worker. + */ + private void releaseWorker(MesosWorkerStore.Worker worker) { + try { + LOG.info("Releasing worker {}", worker.taskID()); + + // update persistent state of worker to Released + worker = worker.releaseTask(); + workerStore.putWorker(worker); + workersBeingReturned.put(extractResourceID(worker.taskID()), worker); + taskRouter.tell(new TaskMonitor.TaskGoalStateUpdated(extractGoalState(worker)), self()); + + if (worker.hostname().isDefined()) { + // tell the launch coordinator that the task is being unassigned from the host, for planning purposes + launchCoordinator.tell(new LaunchCoordinator.Unassign(worker.taskID(), worker.hostname().get()), self()); + } + } + catch (Exception ex) { + fatalError("unable to release worker", ex); + } + } + + @Override + protected int getNumWorkerRequestsPending() { + return workersInNew.size(); + } + + @Override + protected int getNumWorkersPendingRegistration() { + return workersInLaunch.size(); + } + + // ------------------------------------------------------------------------ + // Callbacks from the Mesos Master + // ------------------------------------------------------------------------ + + /** + * Called when connected to Mesos as a new framework. + */ + private void registered(Registered message) { + connectionMonitor.tell(message, self()); + + try { + workerStore.setFrameworkID(Option.apply(message.frameworkId())); + } + catch(Exception ex) { + fatalError("unable to store the assigned framework ID", ex); + return; + } + + launchCoordinator.tell(message, self()); + reconciliationCoordinator.tell(message, self()); + taskRouter.tell(message, self()); + } + + /** + * Called when reconnected to Mesos following a failover event. + */ + private void reregistered(ReRegistered message) { + connectionMonitor.tell(message, self()); + launchCoordinator.tell(message, self()); + reconciliationCoordinator.tell(message, self()); + taskRouter.tell(message, self()); + } + + /** + * Called when disconnected from Mesos. + */ + private void disconnected(Disconnected message) { + connectionMonitor.tell(message, self()); + launchCoordinator.tell(message, self()); + reconciliationCoordinator.tell(message, self()); + taskRouter.tell(message, self()); + } + + /** + * Called when an error is reported by the scheduler callback. + */ + private void error(String message) { + self().tell(new FatalErrorOccurred("Connection to Mesos failed", new Exception(message)), self()); + } + + /** + * Invoked when a Mesos task reaches a terminal status. + */ + private void taskTerminated(Protos.TaskID taskID, Protos.TaskStatus status) { + // this callback occurs for failed containers and for released containers alike + + final ResourceID id = extractResourceID(taskID); + + try { + workerStore.removeWorker(taskID); + } + catch(Exception ex) { + fatalError("unable to remove worker", ex); + return; + } + + // check if this is a failed task or a released task + if (workersBeingReturned.remove(id) != null) { + // regular finished worker that we released + LOG.info("Worker {} finished successfully with diagnostics: {}", + id, status.getMessage()); + } else { + // failed worker, either at startup, or running + final MesosWorkerStore.Worker launched = workersInLaunch.remove(id); + if (launched != null) { + LOG.info("Mesos task {} failed, with a TaskManager in launch or registration. " + + "State: {} Reason: {} ({})", id, status.getState(), status.getReason(), status.getMessage()); + // we will trigger re-acquiring new workers at the end + } else { + // failed registered worker + LOG.info("Mesos task {} failed, with a registered TaskManager. " + + "State: {} Reason: {} ({})", id, status.getState(), status.getReason(), status.getMessage()); + + // notify the generic logic, which notifies the JobManager, etc. + notifyWorkerFailed(id, "Mesos task " + id + " failed. State: " + status.getState()); + } + + // general failure logging + failedTasksSoFar++; + + String diagMessage = String.format("Diagnostics for task %s in state %s : " + + "reason=%s message=%s", + id, status.getState(), status.getReason(), status.getMessage()); + sendInfoMessage(diagMessage); + + LOG.info(diagMessage); + LOG.info("Total number of failed tasks so far: " + failedTasksSoFar); + + // maxFailedTasks == -1 is infinite number of retries. + if (maxFailedTasks >= 0 && failedTasksSoFar > maxFailedTasks) { + String msg = "Stopping Mesos session because the number of failed tasks (" + + failedTasksSoFar + ") exceeded the maximum failed tasks (" + + maxFailedTasks + "). This number is controlled by the '" + + ConfigConstants.MESOS_MAX_FAILED_TASKS + "' configuration setting. " + + "By default its the number of requested tasks."; + + LOG.error(msg); + self().tell(decorateMessage(new StopCluster(ApplicationStatus.FAILED, msg)), + ActorRef.noSender()); + + // no need to do anything else + return; + } + } + + // in case failed containers were among the finished containers, make + // sure we re-examine and request new ones + triggerCheckWorkers(); + } + + // ------------------------------------------------------------------------ + // Utilities + // ------------------------------------------------------------------------ + + private LaunchableMesosWorker createLaunchableMesosWorker(Protos.TaskID taskID) { + LaunchableMesosWorker launchable = + new LaunchableMesosWorker(taskManagerParameters, taskManagerLaunchContext, taskID); + return launchable; — End diff – Could be just `return new LaunchableMesosWorker(taskManagerParameters, taskManagerLaunchContext, taskID);`.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75276925

          — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/MesosFlinkResourceManager.java —
          @@ -0,0 +1,755 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.runtime.clusterframework;
          +
          +import akka.actor.ActorRef;
          +import akka.actor.Props;
          +import com.netflix.fenzo.TaskRequest;
          +import com.netflix.fenzo.TaskScheduler;
          +import com.netflix.fenzo.VirtualMachineLease;
          +import com.netflix.fenzo.functions.Action1;
          +import org.apache.flink.api.java.tuple.Tuple2;
          +import org.apache.flink.configuration.ConfigConstants;
          +import org.apache.flink.configuration.Configuration;
          +import org.apache.flink.mesos.runtime.clusterframework.store.MesosWorkerStore;
          +import org.apache.flink.mesos.scheduler.ConnectionMonitor;
          +import org.apache.flink.mesos.scheduler.LaunchableTask;
          +import org.apache.flink.mesos.scheduler.LaunchCoordinator;
          +import org.apache.flink.mesos.scheduler.ReconciliationCoordinator;
          +import org.apache.flink.mesos.scheduler.SchedulerProxy;
          +import org.apache.flink.mesos.scheduler.TaskMonitor;
          +import org.apache.flink.mesos.scheduler.TaskSchedulerBuilder;
          +import org.apache.flink.mesos.scheduler.Tasks;
          +import org.apache.flink.mesos.scheduler.messages.AcceptOffers;
          +import org.apache.flink.mesos.scheduler.messages.Disconnected;
          +import org.apache.flink.mesos.scheduler.messages.Error;
          +import org.apache.flink.mesos.scheduler.messages.OfferRescinded;
          +import org.apache.flink.mesos.scheduler.messages.ReRegistered;
          +import org.apache.flink.mesos.scheduler.messages.Registered;
          +import org.apache.flink.mesos.scheduler.messages.ResourceOffers;
          +import org.apache.flink.mesos.scheduler.messages.StatusUpdate;
          +import org.apache.flink.mesos.util.MesosConfiguration;
          +import org.apache.flink.runtime.clusterframework.ApplicationStatus;
          +import org.apache.flink.runtime.clusterframework.FlinkResourceManager;
          +import org.apache.flink.runtime.clusterframework.messages.FatalErrorOccurred;
          +import org.apache.flink.runtime.clusterframework.messages.StopCluster;
          +import org.apache.flink.runtime.clusterframework.types.ResourceID;
          +import org.apache.flink.runtime.leaderretrieval.LeaderRetrievalService;
          +import org.apache.mesos.Protos;
          +import org.apache.mesos.Protos.FrameworkInfo;
          +import org.apache.mesos.SchedulerDriver;
          +import org.slf4j.Logger;
          +import scala.Option;
          +
          +import java.util.ArrayList;
          +import java.util.Collection;
          +import java.util.HashMap;
          +import java.util.List;
          +import java.util.Map;
          +
          +import static java.util.Objects.requireNonNull;
          +
          +/**
          + * Flink Resource Manager for Apache Mesos.
          + */
          +public class MesosFlinkResourceManager extends FlinkResourceManager<RegisteredMesosWorkerNode> {
          +
          + /** The Mesos configuration (master and framework info) */
          + private final MesosConfiguration mesosConfig;
          +
          + /** The TaskManager container parameters (like container memory size) */
          + private final MesosTaskManagerParameters taskManagerParameters;
          +
          + /** Context information used to start a TaskManager Java process */
          + private final Protos.TaskInfo.Builder taskManagerLaunchContext;
          +
          + /** Number of failed Mesos tasks before stopping the application. -1 means infinite. */
          + private final int maxFailedTasks;
          +
          + /** Callback handler for the asynchronous Mesos scheduler */
          + private SchedulerProxy schedulerCallbackHandler;
          +
          + /** Mesos scheduler driver */
          + private SchedulerDriver schedulerDriver;
          +
          + private ActorRef connectionMonitor;
          +
          + private ActorRef taskRouter;
          +
          + private ActorRef launchCoordinator;
          +
          + private ActorRef reconciliationCoordinator;
          +
          + private MesosWorkerStore workerStore;
          +
          + final Map<ResourceID, MesosWorkerStore.Worker> workersInNew;
          + final Map<ResourceID, MesosWorkerStore.Worker> workersInLaunch;
          + final Map<ResourceID, MesosWorkerStore.Worker> workersBeingReturned;
          +
          + /** The number of failed tasks since the master became active */
          + private int failedTasksSoFar;
          +
          + public MesosFlinkResourceManager(
          + Configuration flinkConfig,
          + MesosConfiguration mesosConfig,
          + MesosWorkerStore workerStore,
          + LeaderRetrievalService leaderRetrievalService,
          + MesosTaskManagerParameters taskManagerParameters,
          + Protos.TaskInfo.Builder taskManagerLaunchContext,
          + int maxFailedTasks,
          + int numInitialTaskManagers)

          { + + super(numInitialTaskManagers, flinkConfig, leaderRetrievalService); + + this.mesosConfig = requireNonNull(mesosConfig); + + this.workerStore = requireNonNull(workerStore); + + this.taskManagerParameters = requireNonNull(taskManagerParameters); + this.taskManagerLaunchContext = requireNonNull(taskManagerLaunchContext); + this.maxFailedTasks = maxFailedTasks; + + this.workersInNew = new HashMap<>(); + this.workersInLaunch = new HashMap<>(); + this.workersBeingReturned = new HashMap<>(); + }

          +
          + // ------------------------------------------------------------------------
          + // Mesos-specific behavior
          + // ------------------------------------------------------------------------
          +
          + @Override
          + protected void initialize() throws Exception {
          + LOG.info("Initializing Mesos resource master");
          +
          + workerStore.start();
          +
          + // create the scheduler driver to communicate with Mesos
          + schedulerCallbackHandler = new SchedulerProxy(self());
          +
          + // register with Mesos
          + FrameworkInfo.Builder frameworkInfo = mesosConfig.frameworkInfo()
          + .clone()
          + .setCheckpoint(true);
          +
          + Option<Protos.FrameworkID> frameworkID = workerStore.getFrameworkID();
          + if(frameworkID.isEmpty())

          { + LOG.info("Registering as new framework."); + }

          + else {
          + LOG.info("Recovery scenario: re-registering using framework ID {}.", frameworkID.get().getValue());
          + frameworkInfo.setId(frameworkID.get());
          + }
          +
          + MesosConfiguration initializedMesosConfig = mesosConfig.withFrameworkInfo(frameworkInfo);
          + MesosConfiguration.logMesosConfig(LOG, initializedMesosConfig);
          + schedulerDriver = initializedMesosConfig.createDriver(schedulerCallbackHandler, false);
          +
          + // create supporting actors
          + connectionMonitor = createConnectionMonitor();
          + launchCoordinator = createLaunchCoordinator();
          + reconciliationCoordinator = createReconciliationCoordinator();
          + taskRouter = createTaskRouter();
          +
          + recoverWorkers();
          +
          + connectionMonitor.tell(new ConnectionMonitor.Start(), self());
          + schedulerDriver.start();
          + }
          +
          + protected ActorRef createConnectionMonitor()

          { + return context().actorOf( + ConnectionMonitor.createActorProps(ConnectionMonitor.class, config), + "connectionMonitor"); + }

          +
          + protected ActorRef createTaskRouter()

          { + return context().actorOf( + Tasks.createActorProps(Tasks.class, config, schedulerDriver, TaskMonitor.class), + "tasks"); + }

          +
          + protected ActorRef createLaunchCoordinator()

          { + return context().actorOf( + LaunchCoordinator.createActorProps(LaunchCoordinator.class, self(), config, schedulerDriver, createOptimizer()), + "launchCoordinator"); + }

          +
          + protected ActorRef createReconciliationCoordinator()

          { + return context().actorOf( + ReconciliationCoordinator.createActorProps(ReconciliationCoordinator.class, config, schedulerDriver), + "reconciliationCoordinator"); + }

          +
          + @Override
          + public void postStop()

          { + LOG.info("Stopping Mesos resource master"); + super.postStop(); + }

          +
          + // ------------------------------------------------------------------------
          + // Actor messages
          + // ------------------------------------------------------------------------
          +
          + @Override
          + protected void handleMessage(Object message) {
          +
          + // check for Mesos-specific actor messages first
          +
          + // — messages about Mesos connection
          + if (message instanceof Registered)

          { + registered((Registered) message); + }

          else if (message instanceof ReRegistered)

          { + reregistered((ReRegistered) message); + }

          else if (message instanceof Disconnected)

          { + disconnected((Disconnected) message); + }

          else if (message instanceof Error)

          { + error(((Error) message).message()); + + // --- messages about offers + }

          else if (message instanceof ResourceOffers || message instanceof OfferRescinded)

          { + launchCoordinator.tell(message, self()); + }

          else if (message instanceof AcceptOffers)

          { + acceptOffers((AcceptOffers) message); + + // --- messages about tasks + }

          else if (message instanceof StatusUpdate)

          { + taskStatusUpdated((StatusUpdate) message); + }

          else if (message instanceof ReconciliationCoordinator.Reconcile)

          { + // a reconciliation request from a task + reconciliationCoordinator.tell(message, self()); + }

          else if (message instanceof TaskMonitor.TaskTerminated)

          { + // a termination message from a task + TaskMonitor.TaskTerminated msg = (TaskMonitor.TaskTerminated) message; + taskTerminated(msg.taskID(), msg.status()); + + }

          else

          { + // message handled by the generic resource master code + super.handleMessage(message); + }

          + }
          +
          + /**
          + * Called to shut down the cluster (not a failover situation).
          + *
          + * @param finalStatus The application status to report.
          + * @param optionalDiagnostics An optional diagnostics message.
          + */
          — End diff –

          space. will refrain from any future space comments

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75276925 — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/MesosFlinkResourceManager.java — @@ -0,0 +1,755 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.runtime.clusterframework; + +import akka.actor.ActorRef; +import akka.actor.Props; +import com.netflix.fenzo.TaskRequest; +import com.netflix.fenzo.TaskScheduler; +import com.netflix.fenzo.VirtualMachineLease; +import com.netflix.fenzo.functions.Action1; +import org.apache.flink.api.java.tuple.Tuple2; +import org.apache.flink.configuration.ConfigConstants; +import org.apache.flink.configuration.Configuration; +import org.apache.flink.mesos.runtime.clusterframework.store.MesosWorkerStore; +import org.apache.flink.mesos.scheduler.ConnectionMonitor; +import org.apache.flink.mesos.scheduler.LaunchableTask; +import org.apache.flink.mesos.scheduler.LaunchCoordinator; +import org.apache.flink.mesos.scheduler.ReconciliationCoordinator; +import org.apache.flink.mesos.scheduler.SchedulerProxy; +import org.apache.flink.mesos.scheduler.TaskMonitor; +import org.apache.flink.mesos.scheduler.TaskSchedulerBuilder; +import org.apache.flink.mesos.scheduler.Tasks; +import org.apache.flink.mesos.scheduler.messages.AcceptOffers; +import org.apache.flink.mesos.scheduler.messages.Disconnected; +import org.apache.flink.mesos.scheduler.messages.Error; +import org.apache.flink.mesos.scheduler.messages.OfferRescinded; +import org.apache.flink.mesos.scheduler.messages.ReRegistered; +import org.apache.flink.mesos.scheduler.messages.Registered; +import org.apache.flink.mesos.scheduler.messages.ResourceOffers; +import org.apache.flink.mesos.scheduler.messages.StatusUpdate; +import org.apache.flink.mesos.util.MesosConfiguration; +import org.apache.flink.runtime.clusterframework.ApplicationStatus; +import org.apache.flink.runtime.clusterframework.FlinkResourceManager; +import org.apache.flink.runtime.clusterframework.messages.FatalErrorOccurred; +import org.apache.flink.runtime.clusterframework.messages.StopCluster; +import org.apache.flink.runtime.clusterframework.types.ResourceID; +import org.apache.flink.runtime.leaderretrieval.LeaderRetrievalService; +import org.apache.mesos.Protos; +import org.apache.mesos.Protos.FrameworkInfo; +import org.apache.mesos.SchedulerDriver; +import org.slf4j.Logger; +import scala.Option; + +import java.util.ArrayList; +import java.util.Collection; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +import static java.util.Objects.requireNonNull; + +/** + * Flink Resource Manager for Apache Mesos. + */ +public class MesosFlinkResourceManager extends FlinkResourceManager<RegisteredMesosWorkerNode> { + + /** The Mesos configuration (master and framework info) */ + private final MesosConfiguration mesosConfig; + + /** The TaskManager container parameters (like container memory size) */ + private final MesosTaskManagerParameters taskManagerParameters; + + /** Context information used to start a TaskManager Java process */ + private final Protos.TaskInfo.Builder taskManagerLaunchContext; + + /** Number of failed Mesos tasks before stopping the application. -1 means infinite. */ + private final int maxFailedTasks; + + /** Callback handler for the asynchronous Mesos scheduler */ + private SchedulerProxy schedulerCallbackHandler; + + /** Mesos scheduler driver */ + private SchedulerDriver schedulerDriver; + + private ActorRef connectionMonitor; + + private ActorRef taskRouter; + + private ActorRef launchCoordinator; + + private ActorRef reconciliationCoordinator; + + private MesosWorkerStore workerStore; + + final Map<ResourceID, MesosWorkerStore.Worker> workersInNew; + final Map<ResourceID, MesosWorkerStore.Worker> workersInLaunch; + final Map<ResourceID, MesosWorkerStore.Worker> workersBeingReturned; + + /** The number of failed tasks since the master became active */ + private int failedTasksSoFar; + + public MesosFlinkResourceManager( + Configuration flinkConfig, + MesosConfiguration mesosConfig, + MesosWorkerStore workerStore, + LeaderRetrievalService leaderRetrievalService, + MesosTaskManagerParameters taskManagerParameters, + Protos.TaskInfo.Builder taskManagerLaunchContext, + int maxFailedTasks, + int numInitialTaskManagers) { + + super(numInitialTaskManagers, flinkConfig, leaderRetrievalService); + + this.mesosConfig = requireNonNull(mesosConfig); + + this.workerStore = requireNonNull(workerStore); + + this.taskManagerParameters = requireNonNull(taskManagerParameters); + this.taskManagerLaunchContext = requireNonNull(taskManagerLaunchContext); + this.maxFailedTasks = maxFailedTasks; + + this.workersInNew = new HashMap<>(); + this.workersInLaunch = new HashMap<>(); + this.workersBeingReturned = new HashMap<>(); + } + + // ------------------------------------------------------------------------ + // Mesos-specific behavior + // ------------------------------------------------------------------------ + + @Override + protected void initialize() throws Exception { + LOG.info("Initializing Mesos resource master"); + + workerStore.start(); + + // create the scheduler driver to communicate with Mesos + schedulerCallbackHandler = new SchedulerProxy(self()); + + // register with Mesos + FrameworkInfo.Builder frameworkInfo = mesosConfig.frameworkInfo() + .clone() + .setCheckpoint(true); + + Option<Protos.FrameworkID> frameworkID = workerStore.getFrameworkID(); + if(frameworkID.isEmpty()) { + LOG.info("Registering as new framework."); + } + else { + LOG.info("Recovery scenario: re-registering using framework ID {}.", frameworkID.get().getValue()); + frameworkInfo.setId(frameworkID.get()); + } + + MesosConfiguration initializedMesosConfig = mesosConfig.withFrameworkInfo(frameworkInfo); + MesosConfiguration.logMesosConfig(LOG, initializedMesosConfig); + schedulerDriver = initializedMesosConfig.createDriver(schedulerCallbackHandler, false); + + // create supporting actors + connectionMonitor = createConnectionMonitor(); + launchCoordinator = createLaunchCoordinator(); + reconciliationCoordinator = createReconciliationCoordinator(); + taskRouter = createTaskRouter(); + + recoverWorkers(); + + connectionMonitor.tell(new ConnectionMonitor.Start(), self()); + schedulerDriver.start(); + } + + protected ActorRef createConnectionMonitor() { + return context().actorOf( + ConnectionMonitor.createActorProps(ConnectionMonitor.class, config), + "connectionMonitor"); + } + + protected ActorRef createTaskRouter() { + return context().actorOf( + Tasks.createActorProps(Tasks.class, config, schedulerDriver, TaskMonitor.class), + "tasks"); + } + + protected ActorRef createLaunchCoordinator() { + return context().actorOf( + LaunchCoordinator.createActorProps(LaunchCoordinator.class, self(), config, schedulerDriver, createOptimizer()), + "launchCoordinator"); + } + + protected ActorRef createReconciliationCoordinator() { + return context().actorOf( + ReconciliationCoordinator.createActorProps(ReconciliationCoordinator.class, config, schedulerDriver), + "reconciliationCoordinator"); + } + + @Override + public void postStop() { + LOG.info("Stopping Mesos resource master"); + super.postStop(); + } + + // ------------------------------------------------------------------------ + // Actor messages + // ------------------------------------------------------------------------ + + @Override + protected void handleMessage(Object message) { + + // check for Mesos-specific actor messages first + + // — messages about Mesos connection + if (message instanceof Registered) { + registered((Registered) message); + } else if (message instanceof ReRegistered) { + reregistered((ReRegistered) message); + } else if (message instanceof Disconnected) { + disconnected((Disconnected) message); + } else if (message instanceof Error) { + error(((Error) message).message()); + + // --- messages about offers + } else if (message instanceof ResourceOffers || message instanceof OfferRescinded) { + launchCoordinator.tell(message, self()); + } else if (message instanceof AcceptOffers) { + acceptOffers((AcceptOffers) message); + + // --- messages about tasks + } else if (message instanceof StatusUpdate) { + taskStatusUpdated((StatusUpdate) message); + } else if (message instanceof ReconciliationCoordinator.Reconcile) { + // a reconciliation request from a task + reconciliationCoordinator.tell(message, self()); + } else if (message instanceof TaskMonitor.TaskTerminated) { + // a termination message from a task + TaskMonitor.TaskTerminated msg = (TaskMonitor.TaskTerminated) message; + taskTerminated(msg.taskID(), msg.status()); + + } else { + // message handled by the generic resource master code + super.handleMessage(message); + } + } + + /** + * Called to shut down the cluster (not a failover situation). + * + * @param finalStatus The application status to report. + * @param optionalDiagnostics An optional diagnostics message. + */ — End diff – space. will refrain from any future space comments
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75276016

          — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/MesosFlinkResourceManager.java —
          @@ -0,0 +1,755 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.runtime.clusterframework;
          +
          +import akka.actor.ActorRef;
          +import akka.actor.Props;
          +import com.netflix.fenzo.TaskRequest;
          +import com.netflix.fenzo.TaskScheduler;
          +import com.netflix.fenzo.VirtualMachineLease;
          +import com.netflix.fenzo.functions.Action1;
          +import org.apache.flink.api.java.tuple.Tuple2;
          +import org.apache.flink.configuration.ConfigConstants;
          +import org.apache.flink.configuration.Configuration;
          +import org.apache.flink.mesos.runtime.clusterframework.store.MesosWorkerStore;
          +import org.apache.flink.mesos.scheduler.ConnectionMonitor;
          +import org.apache.flink.mesos.scheduler.LaunchableTask;
          +import org.apache.flink.mesos.scheduler.LaunchCoordinator;
          +import org.apache.flink.mesos.scheduler.ReconciliationCoordinator;
          +import org.apache.flink.mesos.scheduler.SchedulerProxy;
          +import org.apache.flink.mesos.scheduler.TaskMonitor;
          +import org.apache.flink.mesos.scheduler.TaskSchedulerBuilder;
          +import org.apache.flink.mesos.scheduler.Tasks;
          +import org.apache.flink.mesos.scheduler.messages.AcceptOffers;
          +import org.apache.flink.mesos.scheduler.messages.Disconnected;
          +import org.apache.flink.mesos.scheduler.messages.Error;
          +import org.apache.flink.mesos.scheduler.messages.OfferRescinded;
          +import org.apache.flink.mesos.scheduler.messages.ReRegistered;
          +import org.apache.flink.mesos.scheduler.messages.Registered;
          +import org.apache.flink.mesos.scheduler.messages.ResourceOffers;
          +import org.apache.flink.mesos.scheduler.messages.StatusUpdate;
          +import org.apache.flink.mesos.util.MesosConfiguration;
          +import org.apache.flink.runtime.clusterframework.ApplicationStatus;
          +import org.apache.flink.runtime.clusterframework.FlinkResourceManager;
          +import org.apache.flink.runtime.clusterframework.messages.FatalErrorOccurred;
          +import org.apache.flink.runtime.clusterframework.messages.StopCluster;
          +import org.apache.flink.runtime.clusterframework.types.ResourceID;
          +import org.apache.flink.runtime.leaderretrieval.LeaderRetrievalService;
          +import org.apache.mesos.Protos;
          +import org.apache.mesos.Protos.FrameworkInfo;
          +import org.apache.mesos.SchedulerDriver;
          +import org.slf4j.Logger;
          +import scala.Option;
          +
          +import java.util.ArrayList;
          +import java.util.Collection;
          +import java.util.HashMap;
          +import java.util.List;
          +import java.util.Map;
          +
          +import static java.util.Objects.requireNonNull;
          +
          +/**
          + * Flink Resource Manager for Apache Mesos.
          + */
          +public class MesosFlinkResourceManager extends FlinkResourceManager<RegisteredMesosWorkerNode> {
          +
          + /** The Mesos configuration (master and framework info) */
          + private final MesosConfiguration mesosConfig;
          +
          + /** The TaskManager container parameters (like container memory size) */
          + private final MesosTaskManagerParameters taskManagerParameters;
          +
          + /** Context information used to start a TaskManager Java process */
          + private final Protos.TaskInfo.Builder taskManagerLaunchContext;
          +
          + /** Number of failed Mesos tasks before stopping the application. -1 means infinite. */
          + private final int maxFailedTasks;
          +
          + /** Callback handler for the asynchronous Mesos scheduler */
          + private SchedulerProxy schedulerCallbackHandler;
          +
          + /** Mesos scheduler driver */
          + private SchedulerDriver schedulerDriver;
          +
          + private ActorRef connectionMonitor;
          +
          + private ActorRef taskRouter;
          +
          + private ActorRef launchCoordinator;
          +
          + private ActorRef reconciliationCoordinator;
          +
          + private MesosWorkerStore workerStore;
          +
          + final Map<ResourceID, MesosWorkerStore.Worker> workersInNew;
          + final Map<ResourceID, MesosWorkerStore.Worker> workersInLaunch;
          + final Map<ResourceID, MesosWorkerStore.Worker> workersBeingReturned;
          — End diff –

          I think for testing purposes it's fine. This is a different matter for public or user-facing API.

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75276016 — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/MesosFlinkResourceManager.java — @@ -0,0 +1,755 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.runtime.clusterframework; + +import akka.actor.ActorRef; +import akka.actor.Props; +import com.netflix.fenzo.TaskRequest; +import com.netflix.fenzo.TaskScheduler; +import com.netflix.fenzo.VirtualMachineLease; +import com.netflix.fenzo.functions.Action1; +import org.apache.flink.api.java.tuple.Tuple2; +import org.apache.flink.configuration.ConfigConstants; +import org.apache.flink.configuration.Configuration; +import org.apache.flink.mesos.runtime.clusterframework.store.MesosWorkerStore; +import org.apache.flink.mesos.scheduler.ConnectionMonitor; +import org.apache.flink.mesos.scheduler.LaunchableTask; +import org.apache.flink.mesos.scheduler.LaunchCoordinator; +import org.apache.flink.mesos.scheduler.ReconciliationCoordinator; +import org.apache.flink.mesos.scheduler.SchedulerProxy; +import org.apache.flink.mesos.scheduler.TaskMonitor; +import org.apache.flink.mesos.scheduler.TaskSchedulerBuilder; +import org.apache.flink.mesos.scheduler.Tasks; +import org.apache.flink.mesos.scheduler.messages.AcceptOffers; +import org.apache.flink.mesos.scheduler.messages.Disconnected; +import org.apache.flink.mesos.scheduler.messages.Error; +import org.apache.flink.mesos.scheduler.messages.OfferRescinded; +import org.apache.flink.mesos.scheduler.messages.ReRegistered; +import org.apache.flink.mesos.scheduler.messages.Registered; +import org.apache.flink.mesos.scheduler.messages.ResourceOffers; +import org.apache.flink.mesos.scheduler.messages.StatusUpdate; +import org.apache.flink.mesos.util.MesosConfiguration; +import org.apache.flink.runtime.clusterframework.ApplicationStatus; +import org.apache.flink.runtime.clusterframework.FlinkResourceManager; +import org.apache.flink.runtime.clusterframework.messages.FatalErrorOccurred; +import org.apache.flink.runtime.clusterframework.messages.StopCluster; +import org.apache.flink.runtime.clusterframework.types.ResourceID; +import org.apache.flink.runtime.leaderretrieval.LeaderRetrievalService; +import org.apache.mesos.Protos; +import org.apache.mesos.Protos.FrameworkInfo; +import org.apache.mesos.SchedulerDriver; +import org.slf4j.Logger; +import scala.Option; + +import java.util.ArrayList; +import java.util.Collection; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +import static java.util.Objects.requireNonNull; + +/** + * Flink Resource Manager for Apache Mesos. + */ +public class MesosFlinkResourceManager extends FlinkResourceManager<RegisteredMesosWorkerNode> { + + /** The Mesos configuration (master and framework info) */ + private final MesosConfiguration mesosConfig; + + /** The TaskManager container parameters (like container memory size) */ + private final MesosTaskManagerParameters taskManagerParameters; + + /** Context information used to start a TaskManager Java process */ + private final Protos.TaskInfo.Builder taskManagerLaunchContext; + + /** Number of failed Mesos tasks before stopping the application. -1 means infinite. */ + private final int maxFailedTasks; + + /** Callback handler for the asynchronous Mesos scheduler */ + private SchedulerProxy schedulerCallbackHandler; + + /** Mesos scheduler driver */ + private SchedulerDriver schedulerDriver; + + private ActorRef connectionMonitor; + + private ActorRef taskRouter; + + private ActorRef launchCoordinator; + + private ActorRef reconciliationCoordinator; + + private MesosWorkerStore workerStore; + + final Map<ResourceID, MesosWorkerStore.Worker> workersInNew; + final Map<ResourceID, MesosWorkerStore.Worker> workersInLaunch; + final Map<ResourceID, MesosWorkerStore.Worker> workersBeingReturned; — End diff – I think for testing purposes it's fine. This is a different matter for public or user-facing API.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75275735

          — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/MesosConfigKeys.java —
          @@ -0,0 +1,44 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.runtime.clusterframework;
          +
          +/**
          + * The Mesos environment variables used for settings of the containers.
          + */
          +public class MesosConfigKeys {
          — End diff –

          I wonder, would it make sense to create `ContainerEnvConfigKeys` with the shared environment variables in `YarnConfigKeys` and `MesosConfigKeys`? The overlap is quite quite significant.

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75275735 — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/MesosConfigKeys.java — @@ -0,0 +1,44 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.runtime.clusterframework; + +/** + * The Mesos environment variables used for settings of the containers. + */ +public class MesosConfigKeys { — End diff – I wonder, would it make sense to create `ContainerEnvConfigKeys` with the shared environment variables in `YarnConfigKeys` and `MesosConfigKeys`? The overlap is quite quite significant.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75275043

          — Diff: flink-mesos/pom.xml —
          @@ -0,0 +1,294 @@
          +<!--
          +Licensed to the Apache Software Foundation (ASF) under one
          +or more contributor license agreements. See the NOTICE file
          +distributed with this work for additional information
          +regarding copyright ownership. The ASF licenses this file
          +to you under the Apache License, Version 2.0 (the
          +"License"); you may not use this file except in compliance
          +with the License. You may obtain a copy of the License at
          +
          + http://www.apache.org/licenses/LICENSE-2.0
          +
          +Unless required by applicable law or agreed to in writing,
          +software distributed under the License is distributed on an
          +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
          +KIND, either express or implied. See the License for the
          +specific language governing permissions and limitations
          +under the License.
          +-->
          +<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
          + xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
          + <modelVersion>4.0.0</modelVersion>
          +
          + <parent>
          + <groupId>org.apache.flink</groupId>
          + <artifactId>flink-parent</artifactId>
          + <version>1.1-SNAPSHOT</version>
          — End diff –

          This needs to be bumped to `1.2-SNAPSHOT`.

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75275043 — Diff: flink-mesos/pom.xml — @@ -0,0 +1,294 @@ +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> +<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" + xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd "> + <modelVersion>4.0.0</modelVersion> + + <parent> + <groupId>org.apache.flink</groupId> + <artifactId>flink-parent</artifactId> + <version>1.1-SNAPSHOT</version> — End diff – This needs to be bumped to `1.2-SNAPSHOT`.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75274022

          — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/MesosApplicationMasterRunner.java —
          @@ -0,0 +1,618 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.runtime.clusterframework;
          +
          +import akka.actor.ActorRef;
          +import akka.actor.ActorSystem;
          +import akka.actor.Props;
          +
          +import org.apache.curator.framework.CuratorFramework;
          +import org.apache.flink.configuration.ConfigConstants;
          +import org.apache.flink.configuration.Configuration;
          +import org.apache.flink.configuration.GlobalConfiguration;
          +import org.apache.flink.configuration.IllegalConfigurationException;
          +import org.apache.flink.mesos.cli.FlinkMesosSessionCli;
          +import org.apache.flink.mesos.runtime.clusterframework.store.MesosWorkerStore;
          +import org.apache.flink.mesos.runtime.clusterframework.store.StandaloneMesosWorkerStore;
          +import org.apache.flink.mesos.runtime.clusterframework.store.ZooKeeperMesosWorkerStore;
          +import org.apache.flink.mesos.util.MesosArtifactServer;
          +import org.apache.flink.mesos.util.MesosConfiguration;
          +import org.apache.flink.mesos.util.ZooKeeperUtils;
          +import org.apache.flink.runtime.akka.AkkaUtils;
          +import org.apache.flink.runtime.clusterframework.BootstrapTools;
          +import org.apache.flink.runtime.clusterframework.ContaineredTaskManagerParameters;
          +import org.apache.flink.runtime.jobmanager.JobManager;
          +import org.apache.flink.runtime.jobmanager.MemoryArchivist;
          +import org.apache.flink.runtime.jobmanager.RecoveryMode;
          +import org.apache.flink.runtime.leaderretrieval.LeaderRetrievalService;
          +import org.apache.flink.runtime.process.ProcessReaper;
          +import org.apache.flink.runtime.taskmanager.TaskManager;
          +import org.apache.flink.runtime.util.EnvironmentInformation;
          +import org.apache.flink.runtime.util.LeaderRetrievalUtils;
          +import org.apache.flink.runtime.util.SignalHandler;
          +import org.apache.flink.runtime.webmonitor.WebMonitor;
          +
          +import org.apache.hadoop.security.UserGroupInformation;
          +
          +import org.apache.mesos.Protos;
          +import org.slf4j.Logger;
          +import org.slf4j.LoggerFactory;
          +
          +import scala.Option;
          +import scala.concurrent.duration.Duration;
          +import scala.concurrent.duration.FiniteDuration;
          +
          +import java.io.File;
          +import java.net.InetAddress;
          +import java.net.URL;
          +import java.security.PrivilegedAction;
          +import java.util.Map;
          +import java.util.UUID;
          +import java.util.concurrent.TimeUnit;
          +
          +import static org.apache.flink.mesos.Utils.uri;
          +import static org.apache.flink.mesos.Utils.variable;
          +
          +/**
          + * This class is the executable entry point for the Mesos Application Master.
          + * It starts actor system and the actors for

          {@link org.apache.flink.runtime.jobmanager.JobManager}

          + * and

          {@link MesosFlinkResourceManager}

          .
          + *
          + * The JobManager handles Flink job execution, while the MesosFlinkResourceManager handles container
          + * allocation and failure detection.
          + */
          +public class MesosApplicationMasterRunner {
          + /** Logger */
          + protected static final Logger LOG = LoggerFactory.getLogger(MesosApplicationMasterRunner.class);
          +
          + /** The maximum time that TaskManagers may be waiting to register at the JobManager,
          + * before they quit */
          + private static final FiniteDuration TASKMANAGER_REGISTRATION_TIMEOUT = new FiniteDuration(5, TimeUnit.MINUTES);
          +
          + /** The process environment variables */
          + private static final Map<String, String> ENV = System.getenv();
          +
          + /** The exit code returned if the initialization of the application master failed */
          + private static final int INIT_ERROR_EXIT_CODE = 31;
          +
          + /** The exit code returned if the process exits because a critical actor died */
          + private static final int ACTOR_DIED_EXIT_CODE = 32;
          +
          + // ------------------------------------------------------------------------
          + // Program entry point
          + // ------------------------------------------------------------------------
          +
          + /**
          + * The entry point for the Mesos AppMaster.
          + *
          + * @param args The command line arguments.
          + */
          + public static void main(String[] args)

          { + EnvironmentInformation.logEnvironmentInfo(LOG, "Mesos AppMaster", args); + SignalHandler.register(LOG); + + // run and exit with the proper return code + int returnCode = new MesosApplicationMasterRunner().run(args); + System.exit(returnCode); + }

          +
          + /**
          + * The instance entry point for the Mesos AppMaster. Obtains user group
          + * information and calls the main work method

          {@link #runPrivileged()}

          as a
          + * privileged action.
          + *
          + * @param args The command line arguments.
          + * @return The process exit code.
          + */
          + protected int run(String[] args) {
          + try {
          + LOG.debug("All environment variables: {}", ENV);
          +
          + final UserGroupInformation currentUser;
          + try

          { + currentUser = UserGroupInformation.getCurrentUser(); + }

          catch (Throwable t)

          { + throw new Exception("Cannot access UserGroupInformation information for current user", t); + }

          +
          + LOG.info("Running Flink as user {}", currentUser.getShortUserName());
          +
          + // run the actual work in a secured privileged action
          + return currentUser.doAs(new PrivilegedAction<Integer>() {
          + @Override
          + public Integer run()

          { + return runPrivileged(); + }

          + });
          + }
          + catch (Throwable t)

          { + // make sure that everything whatever ends up in the log + LOG.error("Mesos AppMaster initialization failed", t); + return INIT_ERROR_EXIT_CODE; + }

          + }
          +
          + // ------------------------------------------------------------------------
          + // Core work method
          + // ------------------------------------------------------------------------
          +
          + /**
          + * The main work method, must run as a privileged action.
          + *
          + * @return The return code for the Java process.
          + */
          + protected int runPrivileged() {
          +
          + ActorSystem actorSystem = null;
          + WebMonitor webMonitor = null;
          + MesosArtifactServer artifactServer = null;
          +
          + try {
          + // ------- (1) load and parse / validate all configurations -------
          +
          + // loading all config values here has the advantage that the program fails fast, if any
          + // configuration problem occurs
          +
          + final String workingDir = ENV.get(MesosConfigKeys.ENV_MESOS_SANDBOX);
          + require(workingDir != null, "Sandbox directory variable (%s) not set", MesosConfigKeys.ENV_MESOS_SANDBOX);
          +
          + final String sessionID = ENV.get(MesosConfigKeys.ENV_SESSION_ID);
          + require(sessionID != null, "Session ID (%s) not set", MesosConfigKeys.ENV_SESSION_ID);
          +
          + // Note that we use the "appMasterHostname" given by the system, to make sure
          + // we use the hostnames consistently throughout akka.
          + // for akka "localhost" and "localhost.localdomain" are different actors.
          + final String appMasterHostname = InetAddress.getLocalHost().getHostName();
          +
          + // Flink configuration
          + final Configuration dynamicProperties =
          + FlinkMesosSessionCli.decodeDynamicProperties(ENV.get(MesosConfigKeys.ENV_DYNAMIC_PROPERTIES));
          + LOG.debug("Mesos dynamic properties: {}", dynamicProperties);
          +
          + final Configuration config = createConfiguration(workingDir, dynamicProperties);
          +
          + // Mesos configuration
          + final MesosConfiguration mesosConfig = createMesosConfig(config, appMasterHostname);
          +
          + // environment values related to TM
          + final int taskManagerContainerMemory;
          + final int numInitialTaskManagers;
          + final int slotsPerTaskManager;
          +
          + try

          { + taskManagerContainerMemory = Integer.parseInt(ENV.get(MesosConfigKeys.ENV_TM_MEMORY)); + }

          catch (NumberFormatException e)

          { + throw new RuntimeException("Invalid value for " + MesosConfigKeys.ENV_TM_MEMORY + " : " + + e.getMessage()); + }

          + try

          { + numInitialTaskManagers = Integer.parseInt(ENV.get(MesosConfigKeys.ENV_TM_COUNT)); + }

          catch (NumberFormatException e)

          { + throw new RuntimeException("Invalid value for " + MesosConfigKeys.ENV_TM_COUNT + " : " + + e.getMessage()); + }

          + try

          { + slotsPerTaskManager = Integer.parseInt(ENV.get(MesosConfigKeys.ENV_SLOTS)); + }

          catch (NumberFormatException e)

          { + throw new RuntimeException("Invalid value for " + MesosConfigKeys.ENV_SLOTS + " : " + + e.getMessage()); + }

          +
          + final ContaineredTaskManagerParameters containeredParameters =
          + ContaineredTaskManagerParameters.create(config, taskManagerContainerMemory, slotsPerTaskManager);
          +
          + final MesosTaskManagerParameters taskManagerParameters =
          + MesosTaskManagerParameters.create(config, containeredParameters);
          +
          + LOG.info("TaskManagers will be created with {} task slots",
          + taskManagerParameters.containeredParameters().numSlots());
          + LOG.info("TaskManagers will be started with container size {} MB, JVM heap size {} MB, " +
          + "JVM direct memory limit {} MB, {} cpus",
          + taskManagerParameters.containeredParameters().taskManagerTotalMemoryMB(),
          + taskManagerParameters.containeredParameters().taskManagerHeapSizeMB(),
          + taskManagerParameters.containeredParameters().taskManagerDirectMemoryLimitMB(),
          + taskManagerParameters.cpus());
          +
          + // JM endpoint, which should be explicitly configured by the dispatcher (based on acquired net resources)
          + final int listeningPort = config.getInteger(ConfigConstants.JOB_MANAGER_IPC_PORT_KEY,
          + ConfigConstants.DEFAULT_JOB_MANAGER_IPC_PORT);
          + require(listeningPort >= 0 && listeningPort <= 65536, "Config parameter \"" +
          + ConfigConstants.JOB_MANAGER_IPC_PORT_KEY + "\" is invalid, it must be between 0 and 65536");
          +
          + // ----------------- (2) start the actor system -------------------
          +
          + // try to start the actor system, JobManager and JobManager actor system
          + // using the configured address and ports
          + actorSystem = BootstrapTools.startActorSystem(config, appMasterHostname, listeningPort, LOG);
          +
          + final String akkaHostname = AkkaUtils.getAddress(actorSystem).host().get();
          + final int akkaPort = (Integer) AkkaUtils.getAddress(actorSystem).port().get();
          +
          + LOG.info("Actor system bound to hostname {}.", akkaHostname);
          +
          + // try to start the artifact server
          + LOG.debug("Starting Artifact Server");
          + final int artifactServerPort = config.getInteger(ConfigConstants.MESOS_ARTIFACT_SERVER_PORT_KEY,
          + ConfigConstants.DEFAULT_MESOS_ARTIFACT_SERVER_PORT);
          + artifactServer = new MesosArtifactServer(sessionID, akkaHostname, artifactServerPort);
          +
          + // ----------------- (3) Generate the configuration for the TaskManagers -------------------
          +
          + final Configuration taskManagerConfig = BootstrapTools.generateTaskManagerConfiguration(
          + config, akkaHostname, akkaPort, slotsPerTaskManager, TASKMANAGER_REGISTRATION_TIMEOUT);
          + LOG.debug("TaskManager configuration: {}", taskManagerConfig);
          +
          + final Protos.TaskInfo.Builder taskManagerContext = createTaskManagerContext(
          + config, mesosConfig, ENV,
          — End diff –

          The `mesosConfig` parameter is unused in the method. Do we want to transfer it to the task managers?

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75274022 — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/MesosApplicationMasterRunner.java — @@ -0,0 +1,618 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.runtime.clusterframework; + +import akka.actor.ActorRef; +import akka.actor.ActorSystem; +import akka.actor.Props; + +import org.apache.curator.framework.CuratorFramework; +import org.apache.flink.configuration.ConfigConstants; +import org.apache.flink.configuration.Configuration; +import org.apache.flink.configuration.GlobalConfiguration; +import org.apache.flink.configuration.IllegalConfigurationException; +import org.apache.flink.mesos.cli.FlinkMesosSessionCli; +import org.apache.flink.mesos.runtime.clusterframework.store.MesosWorkerStore; +import org.apache.flink.mesos.runtime.clusterframework.store.StandaloneMesosWorkerStore; +import org.apache.flink.mesos.runtime.clusterframework.store.ZooKeeperMesosWorkerStore; +import org.apache.flink.mesos.util.MesosArtifactServer; +import org.apache.flink.mesos.util.MesosConfiguration; +import org.apache.flink.mesos.util.ZooKeeperUtils; +import org.apache.flink.runtime.akka.AkkaUtils; +import org.apache.flink.runtime.clusterframework.BootstrapTools; +import org.apache.flink.runtime.clusterframework.ContaineredTaskManagerParameters; +import org.apache.flink.runtime.jobmanager.JobManager; +import org.apache.flink.runtime.jobmanager.MemoryArchivist; +import org.apache.flink.runtime.jobmanager.RecoveryMode; +import org.apache.flink.runtime.leaderretrieval.LeaderRetrievalService; +import org.apache.flink.runtime.process.ProcessReaper; +import org.apache.flink.runtime.taskmanager.TaskManager; +import org.apache.flink.runtime.util.EnvironmentInformation; +import org.apache.flink.runtime.util.LeaderRetrievalUtils; +import org.apache.flink.runtime.util.SignalHandler; +import org.apache.flink.runtime.webmonitor.WebMonitor; + +import org.apache.hadoop.security.UserGroupInformation; + +import org.apache.mesos.Protos; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import scala.Option; +import scala.concurrent.duration.Duration; +import scala.concurrent.duration.FiniteDuration; + +import java.io.File; +import java.net.InetAddress; +import java.net.URL; +import java.security.PrivilegedAction; +import java.util.Map; +import java.util.UUID; +import java.util.concurrent.TimeUnit; + +import static org.apache.flink.mesos.Utils.uri; +import static org.apache.flink.mesos.Utils.variable; + +/** + * This class is the executable entry point for the Mesos Application Master. + * It starts actor system and the actors for {@link org.apache.flink.runtime.jobmanager.JobManager} + * and {@link MesosFlinkResourceManager} . + * + * The JobManager handles Flink job execution, while the MesosFlinkResourceManager handles container + * allocation and failure detection. + */ +public class MesosApplicationMasterRunner { + /** Logger */ + protected static final Logger LOG = LoggerFactory.getLogger(MesosApplicationMasterRunner.class); + + /** The maximum time that TaskManagers may be waiting to register at the JobManager, + * before they quit */ + private static final FiniteDuration TASKMANAGER_REGISTRATION_TIMEOUT = new FiniteDuration(5, TimeUnit.MINUTES); + + /** The process environment variables */ + private static final Map<String, String> ENV = System.getenv(); + + /** The exit code returned if the initialization of the application master failed */ + private static final int INIT_ERROR_EXIT_CODE = 31; + + /** The exit code returned if the process exits because a critical actor died */ + private static final int ACTOR_DIED_EXIT_CODE = 32; + + // ------------------------------------------------------------------------ + // Program entry point + // ------------------------------------------------------------------------ + + /** + * The entry point for the Mesos AppMaster. + * + * @param args The command line arguments. + */ + public static void main(String[] args) { + EnvironmentInformation.logEnvironmentInfo(LOG, "Mesos AppMaster", args); + SignalHandler.register(LOG); + + // run and exit with the proper return code + int returnCode = new MesosApplicationMasterRunner().run(args); + System.exit(returnCode); + } + + /** + * The instance entry point for the Mesos AppMaster. Obtains user group + * information and calls the main work method {@link #runPrivileged()} as a + * privileged action. + * + * @param args The command line arguments. + * @return The process exit code. + */ + protected int run(String[] args) { + try { + LOG.debug("All environment variables: {}", ENV); + + final UserGroupInformation currentUser; + try { + currentUser = UserGroupInformation.getCurrentUser(); + } catch (Throwable t) { + throw new Exception("Cannot access UserGroupInformation information for current user", t); + } + + LOG.info("Running Flink as user {}", currentUser.getShortUserName()); + + // run the actual work in a secured privileged action + return currentUser.doAs(new PrivilegedAction<Integer>() { + @Override + public Integer run() { + return runPrivileged(); + } + }); + } + catch (Throwable t) { + // make sure that everything whatever ends up in the log + LOG.error("Mesos AppMaster initialization failed", t); + return INIT_ERROR_EXIT_CODE; + } + } + + // ------------------------------------------------------------------------ + // Core work method + // ------------------------------------------------------------------------ + + /** + * The main work method, must run as a privileged action. + * + * @return The return code for the Java process. + */ + protected int runPrivileged() { + + ActorSystem actorSystem = null; + WebMonitor webMonitor = null; + MesosArtifactServer artifactServer = null; + + try { + // ------- (1) load and parse / validate all configurations ------- + + // loading all config values here has the advantage that the program fails fast, if any + // configuration problem occurs + + final String workingDir = ENV.get(MesosConfigKeys.ENV_MESOS_SANDBOX); + require(workingDir != null, "Sandbox directory variable (%s) not set", MesosConfigKeys.ENV_MESOS_SANDBOX); + + final String sessionID = ENV.get(MesosConfigKeys.ENV_SESSION_ID); + require(sessionID != null, "Session ID (%s) not set", MesosConfigKeys.ENV_SESSION_ID); + + // Note that we use the "appMasterHostname" given by the system, to make sure + // we use the hostnames consistently throughout akka. + // for akka "localhost" and "localhost.localdomain" are different actors. + final String appMasterHostname = InetAddress.getLocalHost().getHostName(); + + // Flink configuration + final Configuration dynamicProperties = + FlinkMesosSessionCli.decodeDynamicProperties(ENV.get(MesosConfigKeys.ENV_DYNAMIC_PROPERTIES)); + LOG.debug("Mesos dynamic properties: {}", dynamicProperties); + + final Configuration config = createConfiguration(workingDir, dynamicProperties); + + // Mesos configuration + final MesosConfiguration mesosConfig = createMesosConfig(config, appMasterHostname); + + // environment values related to TM + final int taskManagerContainerMemory; + final int numInitialTaskManagers; + final int slotsPerTaskManager; + + try { + taskManagerContainerMemory = Integer.parseInt(ENV.get(MesosConfigKeys.ENV_TM_MEMORY)); + } catch (NumberFormatException e) { + throw new RuntimeException("Invalid value for " + MesosConfigKeys.ENV_TM_MEMORY + " : " + + e.getMessage()); + } + try { + numInitialTaskManagers = Integer.parseInt(ENV.get(MesosConfigKeys.ENV_TM_COUNT)); + } catch (NumberFormatException e) { + throw new RuntimeException("Invalid value for " + MesosConfigKeys.ENV_TM_COUNT + " : " + + e.getMessage()); + } + try { + slotsPerTaskManager = Integer.parseInt(ENV.get(MesosConfigKeys.ENV_SLOTS)); + } catch (NumberFormatException e) { + throw new RuntimeException("Invalid value for " + MesosConfigKeys.ENV_SLOTS + " : " + + e.getMessage()); + } + + final ContaineredTaskManagerParameters containeredParameters = + ContaineredTaskManagerParameters.create(config, taskManagerContainerMemory, slotsPerTaskManager); + + final MesosTaskManagerParameters taskManagerParameters = + MesosTaskManagerParameters.create(config, containeredParameters); + + LOG.info("TaskManagers will be created with {} task slots", + taskManagerParameters.containeredParameters().numSlots()); + LOG.info("TaskManagers will be started with container size {} MB, JVM heap size {} MB, " + + "JVM direct memory limit {} MB, {} cpus", + taskManagerParameters.containeredParameters().taskManagerTotalMemoryMB(), + taskManagerParameters.containeredParameters().taskManagerHeapSizeMB(), + taskManagerParameters.containeredParameters().taskManagerDirectMemoryLimitMB(), + taskManagerParameters.cpus()); + + // JM endpoint, which should be explicitly configured by the dispatcher (based on acquired net resources) + final int listeningPort = config.getInteger(ConfigConstants.JOB_MANAGER_IPC_PORT_KEY, + ConfigConstants.DEFAULT_JOB_MANAGER_IPC_PORT); + require(listeningPort >= 0 && listeningPort <= 65536, "Config parameter \"" + + ConfigConstants.JOB_MANAGER_IPC_PORT_KEY + "\" is invalid, it must be between 0 and 65536"); + + // ----------------- (2) start the actor system ------------------- + + // try to start the actor system, JobManager and JobManager actor system + // using the configured address and ports + actorSystem = BootstrapTools.startActorSystem(config, appMasterHostname, listeningPort, LOG); + + final String akkaHostname = AkkaUtils.getAddress(actorSystem).host().get(); + final int akkaPort = (Integer) AkkaUtils.getAddress(actorSystem).port().get(); + + LOG.info("Actor system bound to hostname {}.", akkaHostname); + + // try to start the artifact server + LOG.debug("Starting Artifact Server"); + final int artifactServerPort = config.getInteger(ConfigConstants.MESOS_ARTIFACT_SERVER_PORT_KEY, + ConfigConstants.DEFAULT_MESOS_ARTIFACT_SERVER_PORT); + artifactServer = new MesosArtifactServer(sessionID, akkaHostname, artifactServerPort); + + // ----------------- (3) Generate the configuration for the TaskManagers ------------------- + + final Configuration taskManagerConfig = BootstrapTools.generateTaskManagerConfiguration( + config, akkaHostname, akkaPort, slotsPerTaskManager, TASKMANAGER_REGISTRATION_TIMEOUT); + LOG.debug("TaskManager configuration: {}", taskManagerConfig); + + final Protos.TaskInfo.Builder taskManagerContext = createTaskManagerContext( + config, mesosConfig, ENV, — End diff – The `mesosConfig` parameter is unused in the method. Do we want to transfer it to the task managers?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75272248

          — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/LaunchableMesosWorker.java —
          @@ -0,0 +1,205 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.runtime.clusterframework;
          +
          +import com.netflix.fenzo.ConstraintEvaluator;
          +import com.netflix.fenzo.TaskAssignmentResult;
          +import com.netflix.fenzo.TaskRequest;
          +import com.netflix.fenzo.VMTaskFitnessCalculator;
          +import org.apache.flink.configuration.Configuration;
          +import org.apache.flink.mesos.cli.FlinkMesosSessionCli;
          +import org.apache.flink.mesos.scheduler.LaunchableTask;
          +import org.apache.mesos.Protos;
          +
          +import java.util.Collections;
          +import java.util.List;
          +import java.util.Map;
          +import java.util.concurrent.atomic.AtomicReference;
          +
          +import static org.apache.flink.mesos.Utils.variable;
          +import static org.apache.flink.mesos.Utils.range;
          +import static org.apache.flink.mesos.Utils.ranges;
          +import static org.apache.flink.mesos.Utils.scalar;
          +
          +/**
          + * Specifies how to launch a Mesos worker.
          + */
          +public class LaunchableMesosWorker implements LaunchableTask {
          +
          + /**
          + * The set of configuration keys to be dynamically configured with a port allocated from Mesos.
          + */
          + private static String[] TM_PORT_KEYS =

          { + "taskmanager.rpc.port", + "taskmanager.data.port" }

          ;
          +
          + private final MesosTaskManagerParameters params;
          + private final Protos.TaskInfo.Builder template;
          + private final Protos.TaskID taskID;
          + private final Request taskRequest;
          +
          + /**
          + * Construct a launchable Mesos worker.
          + * @param params the TM parameters such as memory, cpu to acquire.
          + * @param template a template for the TaskInfo to be constructed at launch time.
          + * @param taskID the taskID for this worker.
          + */
          + public LaunchableMesosWorker(MesosTaskManagerParameters params, Protos.TaskInfo.Builder template, Protos.TaskID taskID)

          { + this.params = params; + this.template = template; + this.taskID = taskID; + this.taskRequest = new Request(); + }

          +
          + public Protos.TaskID taskID()

          { + return taskID; + }

          +
          + @Override
          + public TaskRequest taskRequest()

          { + return taskRequest; + }

          +
          + class Request implements TaskRequest {
          + private final AtomicReference<TaskRequest.AssignedResources> assignedResources = new AtomicReference<>();
          +
          + @Override
          + public String getId()

          { + return taskID.getValue(); + }

          +
          + @Override
          + public String taskGroupName()

          { + return ""; + }

          +
          + @Override
          + public double getCPUs()

          { + return params.cpus(); + }

          +
          + @Override
          + public double getMemory()

          { + return params.containeredParameters().taskManagerTotalMemoryMB(); + }

          +
          + @Override
          + public double getNetworkMbps()

          { + return 0.0; + }

          +
          + @Override
          + public double getDisk() {
          + return 0.0;
          — End diff –

          I would rather throw an exception here if the value is not in use.

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75272248 — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/LaunchableMesosWorker.java — @@ -0,0 +1,205 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.runtime.clusterframework; + +import com.netflix.fenzo.ConstraintEvaluator; +import com.netflix.fenzo.TaskAssignmentResult; +import com.netflix.fenzo.TaskRequest; +import com.netflix.fenzo.VMTaskFitnessCalculator; +import org.apache.flink.configuration.Configuration; +import org.apache.flink.mesos.cli.FlinkMesosSessionCli; +import org.apache.flink.mesos.scheduler.LaunchableTask; +import org.apache.mesos.Protos; + +import java.util.Collections; +import java.util.List; +import java.util.Map; +import java.util.concurrent.atomic.AtomicReference; + +import static org.apache.flink.mesos.Utils.variable; +import static org.apache.flink.mesos.Utils.range; +import static org.apache.flink.mesos.Utils.ranges; +import static org.apache.flink.mesos.Utils.scalar; + +/** + * Specifies how to launch a Mesos worker. + */ +public class LaunchableMesosWorker implements LaunchableTask { + + /** + * The set of configuration keys to be dynamically configured with a port allocated from Mesos. + */ + private static String[] TM_PORT_KEYS = { + "taskmanager.rpc.port", + "taskmanager.data.port" } ; + + private final MesosTaskManagerParameters params; + private final Protos.TaskInfo.Builder template; + private final Protos.TaskID taskID; + private final Request taskRequest; + + /** + * Construct a launchable Mesos worker. + * @param params the TM parameters such as memory, cpu to acquire. + * @param template a template for the TaskInfo to be constructed at launch time. + * @param taskID the taskID for this worker. + */ + public LaunchableMesosWorker(MesosTaskManagerParameters params, Protos.TaskInfo.Builder template, Protos.TaskID taskID) { + this.params = params; + this.template = template; + this.taskID = taskID; + this.taskRequest = new Request(); + } + + public Protos.TaskID taskID() { + return taskID; + } + + @Override + public TaskRequest taskRequest() { + return taskRequest; + } + + class Request implements TaskRequest { + private final AtomicReference<TaskRequest.AssignedResources> assignedResources = new AtomicReference<>(); + + @Override + public String getId() { + return taskID.getValue(); + } + + @Override + public String taskGroupName() { + return ""; + } + + @Override + public double getCPUs() { + return params.cpus(); + } + + @Override + public double getMemory() { + return params.containeredParameters().taskManagerTotalMemoryMB(); + } + + @Override + public double getNetworkMbps() { + return 0.0; + } + + @Override + public double getDisk() { + return 0.0; — End diff – I would rather throw an exception here if the value is not in use.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75272075

          — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/LaunchableMesosWorker.java —
          @@ -0,0 +1,205 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.runtime.clusterframework;
          +
          +import com.netflix.fenzo.ConstraintEvaluator;
          +import com.netflix.fenzo.TaskAssignmentResult;
          +import com.netflix.fenzo.TaskRequest;
          +import com.netflix.fenzo.VMTaskFitnessCalculator;
          +import org.apache.flink.configuration.Configuration;
          +import org.apache.flink.mesos.cli.FlinkMesosSessionCli;
          +import org.apache.flink.mesos.scheduler.LaunchableTask;
          +import org.apache.mesos.Protos;
          +
          +import java.util.Collections;
          +import java.util.List;
          +import java.util.Map;
          +import java.util.concurrent.atomic.AtomicReference;
          +
          +import static org.apache.flink.mesos.Utils.variable;
          +import static org.apache.flink.mesos.Utils.range;
          +import static org.apache.flink.mesos.Utils.ranges;
          +import static org.apache.flink.mesos.Utils.scalar;
          +
          +/**
          + * Specifies how to launch a Mesos worker.
          + */
          +public class LaunchableMesosWorker implements LaunchableTask {
          +
          + /**
          + * The set of configuration keys to be dynamically configured with a port allocated from Mesos.
          + */
          + private static String[] TM_PORT_KEYS =

          { + "taskmanager.rpc.port", + "taskmanager.data.port" }

          ;
          +
          + private final MesosTaskManagerParameters params;
          + private final Protos.TaskInfo.Builder template;
          + private final Protos.TaskID taskID;
          + private final Request taskRequest;
          +
          + /**
          + * Construct a launchable Mesos worker.
          + * @param params the TM parameters such as memory, cpu to acquire.
          + * @param template a template for the TaskInfo to be constructed at launch time.
          + * @param taskID the taskID for this worker.
          + */
          + public LaunchableMesosWorker(MesosTaskManagerParameters params, Protos.TaskInfo.Builder template, Protos.TaskID taskID)

          { + this.params = params; + this.template = template; + this.taskID = taskID; + this.taskRequest = new Request(); + }

          +
          + public Protos.TaskID taskID()

          { + return taskID; + }

          +
          + @Override
          + public TaskRequest taskRequest()

          { + return taskRequest; + }

          +
          + class Request implements TaskRequest {
          + private final AtomicReference<TaskRequest.AssignedResources> assignedResources = new AtomicReference<>();
          +
          + @Override
          + public String getId()

          { + return taskID.getValue(); + }

          +
          + @Override
          + public String taskGroupName()

          { + return ""; + }

          +
          + @Override
          + public double getCPUs()

          { + return params.cpus(); + }

          +
          + @Override
          + public double getMemory()

          { + return params.containeredParameters().taskManagerTotalMemoryMB(); + }

          +
          + @Override
          + public double getNetworkMbps()

          { + return 0.0; + }
          +
          + @Override
          + public double getDisk() { + return 0.0; + }

          +
          + @Override
          + public int getPorts()

          { + return TM_PORT_KEYS.length; + }

          +
          + @Override
          + public Map<String, NamedResourceSetRequest> getCustomNamedResources()

          { + return Collections.emptyMap(); + }

          +
          + @Override
          + public List<? extends ConstraintEvaluator> getHardConstraints()

          { + return null; + }
          +
          + @Override
          + public List<? extends VMTaskFitnessCalculator> getSoftConstraints() { + return null; + }

          +
          + @Override
          + public void setAssignedResources(AssignedResources assignedResources)

          { + this.assignedResources.set(assignedResources); + }

          +
          + @Override
          + public AssignedResources getAssignedResources()

          { + return assignedResources.get(); + }

          +
          + @Override
          + public String toString() {
          + return "Request

          {" + + "cpus=" + getCPUs() + + "memory=" + getMemory() + + '}

          ';
          + }
          + }
          +
          + /**
          + * Construct the TaskInfo needed to launch the worker.
          + * @param slaveId the assigned slave.
          + * @param assignment the assignment details.
          + * @return a fully-baked TaskInfo.
          + */
          + @Override
          + public Protos.TaskInfo launch(Protos.SlaveID slaveId, TaskAssignmentResult assignment) {
          +
          + final Configuration dynamicProperties = new Configuration();
          +
          + // specialize the TaskInfo template with assigned resources, environment variables, etc
          + final Protos.TaskInfo.Builder taskInfo = template
          + .clone()
          + .setSlaveId(slaveId)
          + .setTaskId(taskID)
          + .setName(taskID.getValue())
          + .addResources(scalar("cpus", assignment.getRequest().getCPUs()))
          + .addResources(scalar("mem", assignment.getRequest().getMemory()));
          + //.addResources(scalar("disk", assignment.getRequest.getDisk).setRole("Flink"))
          — End diff –

          Ah, you're not requesting the disk size 0

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75272075 — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/LaunchableMesosWorker.java — @@ -0,0 +1,205 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.runtime.clusterframework; + +import com.netflix.fenzo.ConstraintEvaluator; +import com.netflix.fenzo.TaskAssignmentResult; +import com.netflix.fenzo.TaskRequest; +import com.netflix.fenzo.VMTaskFitnessCalculator; +import org.apache.flink.configuration.Configuration; +import org.apache.flink.mesos.cli.FlinkMesosSessionCli; +import org.apache.flink.mesos.scheduler.LaunchableTask; +import org.apache.mesos.Protos; + +import java.util.Collections; +import java.util.List; +import java.util.Map; +import java.util.concurrent.atomic.AtomicReference; + +import static org.apache.flink.mesos.Utils.variable; +import static org.apache.flink.mesos.Utils.range; +import static org.apache.flink.mesos.Utils.ranges; +import static org.apache.flink.mesos.Utils.scalar; + +/** + * Specifies how to launch a Mesos worker. + */ +public class LaunchableMesosWorker implements LaunchableTask { + + /** + * The set of configuration keys to be dynamically configured with a port allocated from Mesos. + */ + private static String[] TM_PORT_KEYS = { + "taskmanager.rpc.port", + "taskmanager.data.port" } ; + + private final MesosTaskManagerParameters params; + private final Protos.TaskInfo.Builder template; + private final Protos.TaskID taskID; + private final Request taskRequest; + + /** + * Construct a launchable Mesos worker. + * @param params the TM parameters such as memory, cpu to acquire. + * @param template a template for the TaskInfo to be constructed at launch time. + * @param taskID the taskID for this worker. + */ + public LaunchableMesosWorker(MesosTaskManagerParameters params, Protos.TaskInfo.Builder template, Protos.TaskID taskID) { + this.params = params; + this.template = template; + this.taskID = taskID; + this.taskRequest = new Request(); + } + + public Protos.TaskID taskID() { + return taskID; + } + + @Override + public TaskRequest taskRequest() { + return taskRequest; + } + + class Request implements TaskRequest { + private final AtomicReference<TaskRequest.AssignedResources> assignedResources = new AtomicReference<>(); + + @Override + public String getId() { + return taskID.getValue(); + } + + @Override + public String taskGroupName() { + return ""; + } + + @Override + public double getCPUs() { + return params.cpus(); + } + + @Override + public double getMemory() { + return params.containeredParameters().taskManagerTotalMemoryMB(); + } + + @Override + public double getNetworkMbps() { + return 0.0; + } + + @Override + public double getDisk() { + return 0.0; + } + + @Override + public int getPorts() { + return TM_PORT_KEYS.length; + } + + @Override + public Map<String, NamedResourceSetRequest> getCustomNamedResources() { + return Collections.emptyMap(); + } + + @Override + public List<? extends ConstraintEvaluator> getHardConstraints() { + return null; + } + + @Override + public List<? extends VMTaskFitnessCalculator> getSoftConstraints() { + return null; + } + + @Override + public void setAssignedResources(AssignedResources assignedResources) { + this.assignedResources.set(assignedResources); + } + + @Override + public AssignedResources getAssignedResources() { + return assignedResources.get(); + } + + @Override + public String toString() { + return "Request {" + + "cpus=" + getCPUs() + + "memory=" + getMemory() + + '} '; + } + } + + /** + * Construct the TaskInfo needed to launch the worker. + * @param slaveId the assigned slave. + * @param assignment the assignment details. + * @return a fully-baked TaskInfo. + */ + @Override + public Protos.TaskInfo launch(Protos.SlaveID slaveId, TaskAssignmentResult assignment) { + + final Configuration dynamicProperties = new Configuration(); + + // specialize the TaskInfo template with assigned resources, environment variables, etc + final Protos.TaskInfo.Builder taskInfo = template + .clone() + .setSlaveId(slaveId) + .setTaskId(taskID) + .setName(taskID.getValue()) + .addResources(scalar("cpus", assignment.getRequest().getCPUs())) + .addResources(scalar("mem", assignment.getRequest().getMemory())); + //.addResources(scalar("disk", assignment.getRequest.getDisk).setRole("Flink")) — End diff – Ah, you're not requesting the disk size 0
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75271636

          — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/Utils.java —
          @@ -0,0 +1,67 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos;
          +
          +import org.apache.mesos.Protos;
          +
          +import java.net.URL;
          +import java.util.Arrays;
          +
          +public class Utils {
          + /**
          + * Construct a Mesos environment variable.
          + */
          — End diff –

          indention off here

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75271636 — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/Utils.java — @@ -0,0 +1,67 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos; + +import org.apache.mesos.Protos; + +import java.net.URL; +import java.util.Arrays; + +public class Utils { + /** + * Construct a Mesos environment variable. + */ — End diff – indention off here
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75271531

          — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/LaunchableMesosWorker.java —
          @@ -0,0 +1,205 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.runtime.clusterframework;
          +
          +import com.netflix.fenzo.ConstraintEvaluator;
          +import com.netflix.fenzo.TaskAssignmentResult;
          +import com.netflix.fenzo.TaskRequest;
          +import com.netflix.fenzo.VMTaskFitnessCalculator;
          +import org.apache.flink.configuration.Configuration;
          +import org.apache.flink.mesos.cli.FlinkMesosSessionCli;
          +import org.apache.flink.mesos.scheduler.LaunchableTask;
          +import org.apache.mesos.Protos;
          +
          +import java.util.Collections;
          +import java.util.List;
          +import java.util.Map;
          +import java.util.concurrent.atomic.AtomicReference;
          +
          +import static org.apache.flink.mesos.Utils.variable;
          +import static org.apache.flink.mesos.Utils.range;
          +import static org.apache.flink.mesos.Utils.ranges;
          +import static org.apache.flink.mesos.Utils.scalar;
          +
          +/**
          + * Specifies how to launch a Mesos worker.
          + */
          +public class LaunchableMesosWorker implements LaunchableTask {
          +
          + /**
          + * The set of configuration keys to be dynamically configured with a port allocated from Mesos.
          + */
          + private static String[] TM_PORT_KEYS =

          { + "taskmanager.rpc.port", + "taskmanager.data.port" }

          ;
          +
          + private final MesosTaskManagerParameters params;
          + private final Protos.TaskInfo.Builder template;
          + private final Protos.TaskID taskID;
          + private final Request taskRequest;
          +
          + /**
          + * Construct a launchable Mesos worker.
          + * @param params the TM parameters such as memory, cpu to acquire.
          + * @param template a template for the TaskInfo to be constructed at launch time.
          + * @param taskID the taskID for this worker.
          + */
          + public LaunchableMesosWorker(MesosTaskManagerParameters params, Protos.TaskInfo.Builder template, Protos.TaskID taskID)

          { + this.params = params; + this.template = template; + this.taskID = taskID; + this.taskRequest = new Request(); + }

          +
          + public Protos.TaskID taskID()

          { + return taskID; + }

          +
          + @Override
          + public TaskRequest taskRequest()

          { + return taskRequest; + }

          +
          + class Request implements TaskRequest {
          + private final AtomicReference<TaskRequest.AssignedResources> assignedResources = new AtomicReference<>();
          +
          + @Override
          + public String getId()

          { + return taskID.getValue(); + }

          +
          + @Override
          + public String taskGroupName()

          { + return ""; + }

          +
          + @Override
          + public double getCPUs()

          { + return params.cpus(); + }

          +
          + @Override
          + public double getMemory()

          { + return params.containeredParameters().taskManagerTotalMemoryMB(); + }

          +
          + @Override
          + public double getNetworkMbps()

          { + return 0.0; + }

          +
          + @Override
          + public double getDisk() {
          + return 0.0;
          — End diff –

          This is always 0.0 which means give me whatever is free?

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75271531 — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/LaunchableMesosWorker.java — @@ -0,0 +1,205 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.runtime.clusterframework; + +import com.netflix.fenzo.ConstraintEvaluator; +import com.netflix.fenzo.TaskAssignmentResult; +import com.netflix.fenzo.TaskRequest; +import com.netflix.fenzo.VMTaskFitnessCalculator; +import org.apache.flink.configuration.Configuration; +import org.apache.flink.mesos.cli.FlinkMesosSessionCli; +import org.apache.flink.mesos.scheduler.LaunchableTask; +import org.apache.mesos.Protos; + +import java.util.Collections; +import java.util.List; +import java.util.Map; +import java.util.concurrent.atomic.AtomicReference; + +import static org.apache.flink.mesos.Utils.variable; +import static org.apache.flink.mesos.Utils.range; +import static org.apache.flink.mesos.Utils.ranges; +import static org.apache.flink.mesos.Utils.scalar; + +/** + * Specifies how to launch a Mesos worker. + */ +public class LaunchableMesosWorker implements LaunchableTask { + + /** + * The set of configuration keys to be dynamically configured with a port allocated from Mesos. + */ + private static String[] TM_PORT_KEYS = { + "taskmanager.rpc.port", + "taskmanager.data.port" } ; + + private final MesosTaskManagerParameters params; + private final Protos.TaskInfo.Builder template; + private final Protos.TaskID taskID; + private final Request taskRequest; + + /** + * Construct a launchable Mesos worker. + * @param params the TM parameters such as memory, cpu to acquire. + * @param template a template for the TaskInfo to be constructed at launch time. + * @param taskID the taskID for this worker. + */ + public LaunchableMesosWorker(MesosTaskManagerParameters params, Protos.TaskInfo.Builder template, Protos.TaskID taskID) { + this.params = params; + this.template = template; + this.taskID = taskID; + this.taskRequest = new Request(); + } + + public Protos.TaskID taskID() { + return taskID; + } + + @Override + public TaskRequest taskRequest() { + return taskRequest; + } + + class Request implements TaskRequest { + private final AtomicReference<TaskRequest.AssignedResources> assignedResources = new AtomicReference<>(); + + @Override + public String getId() { + return taskID.getValue(); + } + + @Override + public String taskGroupName() { + return ""; + } + + @Override + public double getCPUs() { + return params.cpus(); + } + + @Override + public double getMemory() { + return params.containeredParameters().taskManagerTotalMemoryMB(); + } + + @Override + public double getNetworkMbps() { + return 0.0; + } + + @Override + public double getDisk() { + return 0.0; — End diff – This is always 0.0 which means give me whatever is free?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75270753

          — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/cli/FlinkMesosSessionCli.java —
          @@ -0,0 +1,59 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.cli;
          +
          +import com.fasterxml.jackson.core.JsonProcessingException;
          +import com.fasterxml.jackson.core.type.TypeReference;
          +import com.fasterxml.jackson.databind.ObjectMapper;
          +import org.apache.flink.configuration.Configuration;
          +
          +import java.io.IOException;
          +import java.util.Map;
          +
          +public class FlinkMesosSessionCli {
          — End diff –

          This looks just like a dummy/stub class? Not a CLI yet

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75270753 — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/cli/FlinkMesosSessionCli.java — @@ -0,0 +1,59 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.cli; + +import com.fasterxml.jackson.core.JsonProcessingException; +import com.fasterxml.jackson.core.type.TypeReference; +import com.fasterxml.jackson.databind.ObjectMapper; +import org.apache.flink.configuration.Configuration; + +import java.io.IOException; +import java.util.Map; + +public class FlinkMesosSessionCli { — End diff – This looks just like a dummy/stub class? Not a CLI yet
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75269975

          — Diff: flink-mesos/pom.xml —
          @@ -0,0 +1,294 @@
          +<!--
          +Licensed to the Apache Software Foundation (ASF) under one
          +or more contributor license agreements. See the NOTICE file
          +distributed with this work for additional information
          +regarding copyright ownership. The ASF licenses this file
          +to you under the Apache License, Version 2.0 (the
          +"License"); you may not use this file except in compliance
          +with the License. You may obtain a copy of the License at
          +
          + http://www.apache.org/licenses/LICENSE-2.0
          +
          +Unless required by applicable law or agreed to in writing,
          +software distributed under the License is distributed on an
          +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
          +KIND, either express or implied. See the License for the
          +specific language governing permissions and limitations
          +under the License.
          +-->
          +<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
          + xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
          + <modelVersion>4.0.0</modelVersion>
          +
          + <parent>
          + <groupId>org.apache.flink</groupId>
          + <artifactId>flink-parent</artifactId>
          + <version>1.1-SNAPSHOT</version>
          + <relativePath>..</relativePath>
          + </parent>
          +
          + <artifactId>flink-mesos_2.10</artifactId>
          + <name>flink-mesos</name>
          + <packaging>jar</packaging>
          +
          + <properties>
          + <mesos.version>0.27.1</mesos.version>
          + </properties>
          +
          + <dependencies>
          + <dependency>
          + <groupId>org.apache.flink</groupId>
          + <artifactId>flink-runtime_2.10</artifactId>
          + <version>$

          {project.version}

          </version>
          + <exclusions>
          + <exclusion>
          + <artifactId>hadoop-core</artifactId>
          — End diff –

          Why do you exclude just `hadoop-core` here?

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75269975 — Diff: flink-mesos/pom.xml — @@ -0,0 +1,294 @@ +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> +<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" + xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd "> + <modelVersion>4.0.0</modelVersion> + + <parent> + <groupId>org.apache.flink</groupId> + <artifactId>flink-parent</artifactId> + <version>1.1-SNAPSHOT</version> + <relativePath>..</relativePath> + </parent> + + <artifactId>flink-mesos_2.10</artifactId> + <name>flink-mesos</name> + <packaging>jar</packaging> + + <properties> + <mesos.version>0.27.1</mesos.version> + </properties> + + <dependencies> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-runtime_2.10</artifactId> + <version>$ {project.version} </version> + <exclusions> + <exclusion> + <artifactId>hadoop-core</artifactId> — End diff – Why do you exclude just `hadoop-core` here?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user mxm commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75269835

          — Diff: flink-dist/pom.xml —
          @@ -113,8 +113,13 @@ under the License.
          <artifactId>flink-metrics-jmx</artifactId>
          <version>$

          {project.version}</version>
          </dependency>
          +
          + <dependency>
          + <groupId>org.apache.flink</groupId>
          + <artifactId>flink-mesos_2.10</artifactId>
          + <version>${project.version}

          </version>
          + </dependency>
          — End diff –

          We always build yarn. We use the `include-yarn-tests` profile to include/exclude yarn tests. The `include-yarn` profile, on the other hand, it to exclude yarn for the Haodop 1 version of Flink.

          Show
          githubbot ASF GitHub Bot added a comment - Github user mxm commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75269835 — Diff: flink-dist/pom.xml — @@ -113,8 +113,13 @@ under the License. <artifactId>flink-metrics-jmx</artifactId> <version>$ {project.version}</version> </dependency> + + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-mesos_2.10</artifactId> + <version>${project.version} </version> + </dependency> — End diff – We always build yarn. We use the `include-yarn-tests` profile to include/exclude yarn tests. The `include-yarn` profile, on the other hand, it to exclude yarn for the Haodop 1 version of Flink.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on the issue:

          https://github.com/apache/flink/pull/2315

          Great work @EronWright. Code quality, testing and extent of this PR is really impressive I think you've nicely modularized the Mesos resource manager with the additional actors. This makes testing much nicer

          I only had some minor comments and questions for my own understanding. Tomorrow I want to try it out on a Mesos cluster to see how it works.

          There is only one thing I'm wondering. Since we're also currently working on Flip-6 where we try to put a new RPC abstraction in place in order to eventually remove Akka and Scala from flink-runtime, I wanted to ask whether you think that the FSM actors could also be replaced (sometime in the future) by something else? I totally agree that they are a nice abstraction for Mesos and allow to express the logic succinctly.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/2315 Great work @EronWright. Code quality, testing and extent of this PR is really impressive I think you've nicely modularized the Mesos resource manager with the additional actors. This makes testing much nicer I only had some minor comments and questions for my own understanding. Tomorrow I want to try it out on a Mesos cluster to see how it works. There is only one thing I'm wondering. Since we're also currently working on Flip-6 where we try to put a new RPC abstraction in place in order to eventually remove Akka and Scala from flink-runtime, I wanted to ask whether you think that the FSM actors could also be replaced (sometime in the future) by something else? I totally agree that they are a nice abstraction for Mesos and allow to express the logic succinctly.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75156694

          — Diff: flink-mesos/src/test/resources/log4j-test.properties —
          @@ -0,0 +1,32 @@
          +################################################################################
          +# Licensed to the Apache Software Foundation (ASF) under one
          +# or more contributor license agreements. See the NOTICE file
          +# distributed with this work for additional information
          +# regarding copyright ownership. The ASF licenses this file
          +# to you under the Apache License, Version 2.0 (the
          +# "License"); you may not use this file except in compliance
          +# with the License. You may obtain a copy of the License at
          +#
          +# http://www.apache.org/licenses/LICENSE-2.0
          +#
          +# Unless required by applicable law or agreed to in writing, software
          +# distributed under the License is distributed on an "AS IS" BASIS,
          +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          +# See the License for the specific language governing permissions and
          +# limitations under the License.
          +################################################################################
          +
          +log4j.rootLogger=INFO, console
          — End diff –

          log level should be `OFF`.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75156694 — Diff: flink-mesos/src/test/resources/log4j-test.properties — @@ -0,0 +1,32 @@ +################################################################################ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +################################################################################ + +log4j.rootLogger=INFO, console — End diff – log level should be `OFF`.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user EronWright commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75154390

          — Diff: flink-mesos/src/main/scala/org/apache/flink/runtime/clusterframework/ContaineredJobManager.scala —
          @@ -0,0 +1,174 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.runtime.clusterframework
          +
          +import java.util.concurrent.

          {TimeUnit, ExecutorService}

          +
          +import akka.actor.ActorRef
          +
          +import org.apache.flink.api.common.JobID
          +import org.apache.flink.configuration.

          {Configuration => FlinkConfiguration, ConfigConstants}

          +import org.apache.flink.runtime.checkpoint.savepoint.SavepointStore
          +import org.apache.flink.runtime.checkpoint.CheckpointRecoveryFactory
          +import org.apache.flink.runtime.clusterframework.ApplicationStatus
          +import org.apache.flink.runtime.executiongraph.restart.RestartStrategyFactory
          +import org.apache.flink.runtime.clusterframework.messages._
          +import org.apache.flink.runtime.jobgraph.JobStatus
          +import org.apache.flink.runtime.jobmanager.

          {SubmittedJobGraphStore, JobManager}

          +import org.apache.flink.runtime.leaderelection.LeaderElectionService
          +import org.apache.flink.runtime.messages.JobManagerMessages.

          {RequestJobStatus, CurrentJobStatus, JobNotFound}

          +import org.apache.flink.runtime.messages.Messages.Acknowledge
          +import org.apache.flink.runtime.metrics.

          {MetricRegistry => FlinkMetricRegistry}

          +import org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager
          +import org.apache.flink.runtime.instance.InstanceManager
          +import org.apache.flink.runtime.jobmanager.scheduler.

          {Scheduler => FlinkScheduler}

          +
          +import scala.concurrent.duration._
          +import scala.language.postfixOps
          +
          +
          +/** JobManager actor for execution on Yarn or Mesos. It enriches the [[JobManager]] with additional messages
          + * to start/administer/stop the session.
          + *
          + * @param flinkConfiguration Configuration object for the actor
          + * @param executorService Execution context which is used to execute concurrent tasks in the
          + * [[org.apache.flink.runtime.executiongraph.ExecutionGraph]]
          + * @param instanceManager Instance manager to manage the registered
          + * [[org.apache.flink.runtime.taskmanager.TaskManager]]
          + * @param scheduler Scheduler to schedule Flink jobs
          + * @param libraryCacheManager Manager to manage uploaded jar files
          + * @param archive Archive for finished Flink jobs
          + * @param restartStrategyFactory Restart strategy to be used in case of a job recovery
          + * @param timeout Timeout for futures
          + * @param leaderElectionService LeaderElectionService to participate in the leader election
          + */
          +abstract class ContaineredJobManager(
          + flinkConfiguration: FlinkConfiguration,
          + executorService: ExecutorService,
          + instanceManager: InstanceManager,
          + scheduler: FlinkScheduler,
          + libraryCacheManager: BlobLibraryCacheManager,
          + archive: ActorRef,
          + restartStrategyFactory: RestartStrategyFactory,
          + timeout: FiniteDuration,
          + leaderElectionService: LeaderElectionService,
          + submittedJobGraphs : SubmittedJobGraphStore,
          + checkpointRecoveryFactory : CheckpointRecoveryFactory,
          + savepointStore: SavepointStore,
          + jobRecoveryTimeout: FiniteDuration,
          + metricsRegistry: Option[FlinkMetricRegistry])
          + extends JobManager(
          + flinkConfiguration,
          + executorService,
          + instanceManager,
          + scheduler,
          + libraryCacheManager,
          + archive,
          + restartStrategyFactory,
          + timeout,
          + leaderElectionService,
          + submittedJobGraphs,
          + checkpointRecoveryFactory,
          + savepointStore,
          + jobRecoveryTimeout,
          + metricsRegistry) {
          +
          + val jobPollingInterval: FiniteDuration
          +
          + // indicates if this JM has been started in a dedicated (per-job) mode.
          + var stopWhenJobFinished: JobID = null
          +
          + override def handleMessage: Receive =

          { + handleContainerMessage orElse super.handleMessage + }

          +
          + def handleContainerMessage: Receive = {
          +
          + case msg @ (_: RegisterInfoMessageListener | _: UnRegisterInfoMessageListener) =>
          + // forward to ResourceManager
          + currentResourceManager match

          { + case Some(rm) => + // we forward the message + rm.forward(decorateMessage(msg)) + case None => + // client has to try again + }

          +
          + case msg: ShutdownClusterAfterJob =>
          + val jobId = msg.jobId()
          + log.info(s"ApplicationMaster will shut down session when job $jobId has finished.")
          + stopWhenJobFinished = jobId
          + // trigger regular job status messages (if this is a dedicated/per-job cluster)
          + if (stopWhenJobFinished != null) {
          + context.system.scheduler.schedule(0 seconds,
          — End diff –

          This is code that is cut-pasted from the `YarnJobManager`. Indeed I'm meaning to have `YarnJobManager` extend from the above`ContaineredJobManager` to consolidate the logic. I don't want to challenge the logic itself; @mxm might best be able to address your remark.

          Show
          githubbot ASF GitHub Bot added a comment - Github user EronWright commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75154390 — Diff: flink-mesos/src/main/scala/org/apache/flink/runtime/clusterframework/ContaineredJobManager.scala — @@ -0,0 +1,174 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.runtime.clusterframework + +import java.util.concurrent. {TimeUnit, ExecutorService} + +import akka.actor.ActorRef + +import org.apache.flink.api.common.JobID +import org.apache.flink.configuration. {Configuration => FlinkConfiguration, ConfigConstants} +import org.apache.flink.runtime.checkpoint.savepoint.SavepointStore +import org.apache.flink.runtime.checkpoint.CheckpointRecoveryFactory +import org.apache.flink.runtime.clusterframework.ApplicationStatus +import org.apache.flink.runtime.executiongraph.restart.RestartStrategyFactory +import org.apache.flink.runtime.clusterframework.messages._ +import org.apache.flink.runtime.jobgraph.JobStatus +import org.apache.flink.runtime.jobmanager. {SubmittedJobGraphStore, JobManager} +import org.apache.flink.runtime.leaderelection.LeaderElectionService +import org.apache.flink.runtime.messages.JobManagerMessages. {RequestJobStatus, CurrentJobStatus, JobNotFound} +import org.apache.flink.runtime.messages.Messages.Acknowledge +import org.apache.flink.runtime.metrics. {MetricRegistry => FlinkMetricRegistry} +import org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager +import org.apache.flink.runtime.instance.InstanceManager +import org.apache.flink.runtime.jobmanager.scheduler. {Scheduler => FlinkScheduler} + +import scala.concurrent.duration._ +import scala.language.postfixOps + + +/** JobManager actor for execution on Yarn or Mesos. It enriches the [ [JobManager] ] with additional messages + * to start/administer/stop the session. + * + * @param flinkConfiguration Configuration object for the actor + * @param executorService Execution context which is used to execute concurrent tasks in the + * [ [org.apache.flink.runtime.executiongraph.ExecutionGraph] ] + * @param instanceManager Instance manager to manage the registered + * [ [org.apache.flink.runtime.taskmanager.TaskManager] ] + * @param scheduler Scheduler to schedule Flink jobs + * @param libraryCacheManager Manager to manage uploaded jar files + * @param archive Archive for finished Flink jobs + * @param restartStrategyFactory Restart strategy to be used in case of a job recovery + * @param timeout Timeout for futures + * @param leaderElectionService LeaderElectionService to participate in the leader election + */ +abstract class ContaineredJobManager( + flinkConfiguration: FlinkConfiguration, + executorService: ExecutorService, + instanceManager: InstanceManager, + scheduler: FlinkScheduler, + libraryCacheManager: BlobLibraryCacheManager, + archive: ActorRef, + restartStrategyFactory: RestartStrategyFactory, + timeout: FiniteDuration, + leaderElectionService: LeaderElectionService, + submittedJobGraphs : SubmittedJobGraphStore, + checkpointRecoveryFactory : CheckpointRecoveryFactory, + savepointStore: SavepointStore, + jobRecoveryTimeout: FiniteDuration, + metricsRegistry: Option [FlinkMetricRegistry] ) + extends JobManager( + flinkConfiguration, + executorService, + instanceManager, + scheduler, + libraryCacheManager, + archive, + restartStrategyFactory, + timeout, + leaderElectionService, + submittedJobGraphs, + checkpointRecoveryFactory, + savepointStore, + jobRecoveryTimeout, + metricsRegistry) { + + val jobPollingInterval: FiniteDuration + + // indicates if this JM has been started in a dedicated (per-job) mode. + var stopWhenJobFinished: JobID = null + + override def handleMessage: Receive = { + handleContainerMessage orElse super.handleMessage + } + + def handleContainerMessage: Receive = { + + case msg @ (_: RegisterInfoMessageListener | _: UnRegisterInfoMessageListener) => + // forward to ResourceManager + currentResourceManager match { + case Some(rm) => + // we forward the message + rm.forward(decorateMessage(msg)) + case None => + // client has to try again + } + + case msg: ShutdownClusterAfterJob => + val jobId = msg.jobId() + log.info(s"ApplicationMaster will shut down session when job $jobId has finished.") + stopWhenJobFinished = jobId + // trigger regular job status messages (if this is a dedicated/per-job cluster) + if (stopWhenJobFinished != null) { + context.system.scheduler.schedule(0 seconds, — End diff – This is code that is cut-pasted from the `YarnJobManager`. Indeed I'm meaning to have `YarnJobManager` extend from the above`ContaineredJobManager` to consolidate the logic. I don't want to challenge the logic itself; @mxm might best be able to address your remark.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user EronWright commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75153237

          — Diff: flink-mesos/src/main/scala/org/apache/flink/mesos/scheduler/LaunchCoordinator.scala —
          @@ -0,0 +1,349 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.scheduler
          +
          +import akka.actor.

          {Actor, ActorRef, FSM, Props}

          +import com.netflix.fenzo._
          +import com.netflix.fenzo.functions.Action1
          +import com.netflix.fenzo.plugins.VMLeaseObject
          +import grizzled.slf4j.Logger
          +import org.apache.flink.api.java.tuple.

          {Tuple2=>FlinkTuple2}

          +import org.apache.flink.configuration.Configuration
          +import org.apache.flink.mesos.scheduler.LaunchCoordinator._
          +import org.apache.flink.mesos.scheduler.messages._
          +import org.apache.mesos.Protos.TaskInfo
          +import org.apache.mesos.

          {SchedulerDriver, Protos}

          +
          +import scala.collection.JavaConverters._
          +import scala.collection.mutable.

          {Map => MutableMap}

          +import scala.concurrent.duration._
          +
          +/**
          + * The launch coordinator handles offer processing, including
          + * matching offers to tasks and making reservations.
          + *
          + * The coordinator uses Netflix Fenzo to optimize task placement. During the GatheringOffers phase,
          + * offers are evaluated by Fenzo for suitability to the planned tasks. Reservations are then placed
          + * against the best offers, leading to revised offers containing reserved resources with which to launch task(s).
          + */
          +class LaunchCoordinator(
          + manager: ActorRef,
          + config: Configuration,
          + schedulerDriver: SchedulerDriver,
          + optimizerBuilder: TaskSchedulerBuilder
          + ) extends Actor with FSM[TaskState, GatherData] {
          +
          + val LOG = Logger(getClass)
          +
          + /**
          + * The task placement optimizer.
          + *
          + * The optimizer contains the following state:
          + * - unused offers
          + * - existing task placement (for fitness calculation involving task colocation)
          + */
          + private[mesos] val optimizer: TaskScheduler = {
          + optimizerBuilder
          + .withLeaseRejectAction(new Action1[VirtualMachineLease]() {
          + def call(lease: VirtualMachineLease) {
          + LOG.info(s"Declined offer $

          {lease.getId}

          from $

          {lease.hostname()}

          of $

          {lease.memoryMB()}

          MB, $

          {lease.cpuCores()}

          cpus.")
          + schedulerDriver.declineOffer(lease.getOffer.getId)
          + }
          + }).build
          + }
          +
          + override def postStop(): Unit =

          { + optimizer.shutdown() + super.postStop() + }

          +
          + /**
          + * Initial state
          + */
          + startWith(Suspended, GatherData(tasks = Nil, newLeases = Nil))
          +
          + /**
          + * State: Suspended
          + *
          + * Wait for (re-)connection to Mesos. No offers exist in this state, but outstanding tasks might.
          + */
          + when(Suspended)

          { + case Event(msg: Connected, data: GatherData) => + if(data.tasks.nonEmpty) goto(GatheringOffers) + else goto(Idle) + }

          +
          + /**
          + * State: Idle
          + *
          + * Wait for a task request to arrive, then transition into gathering offers.
          + */
          + onTransition

          { + case _ -> Idle => assert(nextStateData.tasks.isEmpty) + }

          +
          + when(Idle) {
          + case Event(msg: Disconnected, data: GatherData) =>
          + goto(Suspended)
          +
          + case Event(offers: ResourceOffers, data: GatherData) =>
          + // decline any offers that come in
          + schedulerDriver.suppressOffers()
          + for(offer <- offers.offers().asScala)

          { schedulerDriver.declineOffer(offer.getId) }

          + stay()
          +
          + case Event(msg: Launch, data: GatherData) =>
          + goto(GatheringOffers) using data.copy(tasks = data.tasks ++ msg.tasks.asScala)
          + }
          +
          + /**
          + * Transition logic to control the flow of offers.
          + */
          + onTransition {
          + case _ -> GatheringOffers =>
          + LOG.info(s"Now gathering offers for at least $

          {nextStateData.tasks.length}

          task(s).")
          + schedulerDriver.reviveOffers()
          +
          + case GatheringOffers -> _ =>
          + // decline any outstanding offers and suppress future offers
          + LOG.info(s"No longer gathering offers; all requests fulfilled.")
          +
          + assert(nextStateData.newLeases.isEmpty)
          + schedulerDriver.suppressOffers()
          + optimizer.expireAllLeases()
          + }
          +
          + /**
          + * State: GatheringOffers
          + *
          + * Wait for offers to accumulate for a fixed length of time or from specific slaves.
          — End diff –

          Usually one waits for offers to maximize the overall fitness; there's a fundamental latency/fitness tradeoff. I agree that offer quality (aside from the hard constraints of cpu/ram) is not an important consideration at this time. This overall design allows for tuning along many dimensions, and I didn't want to engage in premature optimization in this first-cut. Nonetheless I will change the timeout to be much more aggressive.

          Show
          githubbot ASF GitHub Bot added a comment - Github user EronWright commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75153237 — Diff: flink-mesos/src/main/scala/org/apache/flink/mesos/scheduler/LaunchCoordinator.scala — @@ -0,0 +1,349 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.scheduler + +import akka.actor. {Actor, ActorRef, FSM, Props} +import com.netflix.fenzo._ +import com.netflix.fenzo.functions.Action1 +import com.netflix.fenzo.plugins.VMLeaseObject +import grizzled.slf4j.Logger +import org.apache.flink.api.java.tuple. {Tuple2=>FlinkTuple2} +import org.apache.flink.configuration.Configuration +import org.apache.flink.mesos.scheduler.LaunchCoordinator._ +import org.apache.flink.mesos.scheduler.messages._ +import org.apache.mesos.Protos.TaskInfo +import org.apache.mesos. {SchedulerDriver, Protos} + +import scala.collection.JavaConverters._ +import scala.collection.mutable. {Map => MutableMap} +import scala.concurrent.duration._ + +/** + * The launch coordinator handles offer processing, including + * matching offers to tasks and making reservations. + * + * The coordinator uses Netflix Fenzo to optimize task placement. During the GatheringOffers phase, + * offers are evaluated by Fenzo for suitability to the planned tasks. Reservations are then placed + * against the best offers, leading to revised offers containing reserved resources with which to launch task(s). + */ +class LaunchCoordinator( + manager: ActorRef, + config: Configuration, + schedulerDriver: SchedulerDriver, + optimizerBuilder: TaskSchedulerBuilder + ) extends Actor with FSM [TaskState, GatherData] { + + val LOG = Logger(getClass) + + /** + * The task placement optimizer. + * + * The optimizer contains the following state: + * - unused offers + * - existing task placement (for fitness calculation involving task colocation) + */ + private [mesos] val optimizer: TaskScheduler = { + optimizerBuilder + .withLeaseRejectAction(new Action1 [VirtualMachineLease] () { + def call(lease: VirtualMachineLease) { + LOG.info(s"Declined offer $ {lease.getId} from $ {lease.hostname()} of $ {lease.memoryMB()} MB, $ {lease.cpuCores()} cpus.") + schedulerDriver.declineOffer(lease.getOffer.getId) + } + }).build + } + + override def postStop(): Unit = { + optimizer.shutdown() + super.postStop() + } + + /** + * Initial state + */ + startWith(Suspended, GatherData(tasks = Nil, newLeases = Nil)) + + /** + * State: Suspended + * + * Wait for (re-)connection to Mesos. No offers exist in this state, but outstanding tasks might. + */ + when(Suspended) { + case Event(msg: Connected, data: GatherData) => + if(data.tasks.nonEmpty) goto(GatheringOffers) + else goto(Idle) + } + + /** + * State: Idle + * + * Wait for a task request to arrive, then transition into gathering offers. + */ + onTransition { + case _ -> Idle => assert(nextStateData.tasks.isEmpty) + } + + when(Idle) { + case Event(msg: Disconnected, data: GatherData) => + goto(Suspended) + + case Event(offers: ResourceOffers, data: GatherData) => + // decline any offers that come in + schedulerDriver.suppressOffers() + for(offer <- offers.offers().asScala) { schedulerDriver.declineOffer(offer.getId) } + stay() + + case Event(msg: Launch, data: GatherData) => + goto(GatheringOffers) using data.copy(tasks = data.tasks ++ msg.tasks.asScala) + } + + /** + * Transition logic to control the flow of offers. + */ + onTransition { + case _ -> GatheringOffers => + LOG.info(s"Now gathering offers for at least $ {nextStateData.tasks.length} task(s).") + schedulerDriver.reviveOffers() + + case GatheringOffers -> _ => + // decline any outstanding offers and suppress future offers + LOG.info(s"No longer gathering offers; all requests fulfilled.") + + assert(nextStateData.newLeases.isEmpty) + schedulerDriver.suppressOffers() + optimizer.expireAllLeases() + } + + /** + * State: GatheringOffers + * + * Wait for offers to accumulate for a fixed length of time or from specific slaves. — End diff – Usually one waits for offers to maximize the overall fitness; there's a fundamental latency/fitness tradeoff. I agree that offer quality (aside from the hard constraints of cpu/ram) is not an important consideration at this time. This overall design allows for tuning along many dimensions, and I didn't want to engage in premature optimization in this first-cut. Nonetheless I will change the timeout to be much more aggressive.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75151735

          — Diff: flink-mesos/src/main/scala/org/apache/flink/runtime/clusterframework/ContaineredJobManager.scala —
          @@ -0,0 +1,174 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.runtime.clusterframework
          +
          +import java.util.concurrent.

          {TimeUnit, ExecutorService}

          +
          +import akka.actor.ActorRef
          +
          +import org.apache.flink.api.common.JobID
          +import org.apache.flink.configuration.

          {Configuration => FlinkConfiguration, ConfigConstants}

          +import org.apache.flink.runtime.checkpoint.savepoint.SavepointStore
          +import org.apache.flink.runtime.checkpoint.CheckpointRecoveryFactory
          +import org.apache.flink.runtime.clusterframework.ApplicationStatus
          +import org.apache.flink.runtime.executiongraph.restart.RestartStrategyFactory
          +import org.apache.flink.runtime.clusterframework.messages._
          +import org.apache.flink.runtime.jobgraph.JobStatus
          +import org.apache.flink.runtime.jobmanager.

          {SubmittedJobGraphStore, JobManager}

          +import org.apache.flink.runtime.leaderelection.LeaderElectionService
          +import org.apache.flink.runtime.messages.JobManagerMessages.

          {RequestJobStatus, CurrentJobStatus, JobNotFound}

          +import org.apache.flink.runtime.messages.Messages.Acknowledge
          +import org.apache.flink.runtime.metrics.

          {MetricRegistry => FlinkMetricRegistry}

          +import org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager
          +import org.apache.flink.runtime.instance.InstanceManager
          +import org.apache.flink.runtime.jobmanager.scheduler.

          {Scheduler => FlinkScheduler}

          +
          +import scala.concurrent.duration._
          +import scala.language.postfixOps
          +
          +
          +/** JobManager actor for execution on Yarn or Mesos. It enriches the [[JobManager]] with additional messages
          + * to start/administer/stop the session.
          + *
          + * @param flinkConfiguration Configuration object for the actor
          + * @param executorService Execution context which is used to execute concurrent tasks in the
          + * [[org.apache.flink.runtime.executiongraph.ExecutionGraph]]
          + * @param instanceManager Instance manager to manage the registered
          + * [[org.apache.flink.runtime.taskmanager.TaskManager]]
          + * @param scheduler Scheduler to schedule Flink jobs
          + * @param libraryCacheManager Manager to manage uploaded jar files
          + * @param archive Archive for finished Flink jobs
          + * @param restartStrategyFactory Restart strategy to be used in case of a job recovery
          + * @param timeout Timeout for futures
          + * @param leaderElectionService LeaderElectionService to participate in the leader election
          + */
          +abstract class ContaineredJobManager(
          + flinkConfiguration: FlinkConfiguration,
          + executorService: ExecutorService,
          + instanceManager: InstanceManager,
          + scheduler: FlinkScheduler,
          + libraryCacheManager: BlobLibraryCacheManager,
          + archive: ActorRef,
          + restartStrategyFactory: RestartStrategyFactory,
          + timeout: FiniteDuration,
          + leaderElectionService: LeaderElectionService,
          + submittedJobGraphs : SubmittedJobGraphStore,
          + checkpointRecoveryFactory : CheckpointRecoveryFactory,
          + savepointStore: SavepointStore,
          + jobRecoveryTimeout: FiniteDuration,
          + metricsRegistry: Option[FlinkMetricRegistry])
          + extends JobManager(
          + flinkConfiguration,
          + executorService,
          + instanceManager,
          + scheduler,
          + libraryCacheManager,
          + archive,
          + restartStrategyFactory,
          + timeout,
          + leaderElectionService,
          + submittedJobGraphs,
          + checkpointRecoveryFactory,
          + savepointStore,
          + jobRecoveryTimeout,
          + metricsRegistry) {
          +
          + val jobPollingInterval: FiniteDuration
          +
          + // indicates if this JM has been started in a dedicated (per-job) mode.
          + var stopWhenJobFinished: JobID = null
          +
          + override def handleMessage: Receive =

          { + handleContainerMessage orElse super.handleMessage + }

          +
          + def handleContainerMessage: Receive = {
          +
          + case msg @ (_: RegisterInfoMessageListener | _: UnRegisterInfoMessageListener) =>
          + // forward to ResourceManager
          + currentResourceManager match

          { + case Some(rm) => + // we forward the message + rm.forward(decorateMessage(msg)) + case None => + // client has to try again + }

          +
          + case msg: ShutdownClusterAfterJob =>
          + val jobId = msg.jobId()
          + log.info(s"ApplicationMaster will shut down session when job $jobId has finished.")
          + stopWhenJobFinished = jobId
          + // trigger regular job status messages (if this is a dedicated/per-job cluster)
          + if (stopWhenJobFinished != null) {
          + context.system.scheduler.schedule(0 seconds,
          — End diff –

          Can't we listen on the `RemoveJob` message or the `JobStatusChanged` message to get notified when a job has terminated. Then we don't have to poll the status from oneself.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75151735 — Diff: flink-mesos/src/main/scala/org/apache/flink/runtime/clusterframework/ContaineredJobManager.scala — @@ -0,0 +1,174 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.runtime.clusterframework + +import java.util.concurrent. {TimeUnit, ExecutorService} + +import akka.actor.ActorRef + +import org.apache.flink.api.common.JobID +import org.apache.flink.configuration. {Configuration => FlinkConfiguration, ConfigConstants} +import org.apache.flink.runtime.checkpoint.savepoint.SavepointStore +import org.apache.flink.runtime.checkpoint.CheckpointRecoveryFactory +import org.apache.flink.runtime.clusterframework.ApplicationStatus +import org.apache.flink.runtime.executiongraph.restart.RestartStrategyFactory +import org.apache.flink.runtime.clusterframework.messages._ +import org.apache.flink.runtime.jobgraph.JobStatus +import org.apache.flink.runtime.jobmanager. {SubmittedJobGraphStore, JobManager} +import org.apache.flink.runtime.leaderelection.LeaderElectionService +import org.apache.flink.runtime.messages.JobManagerMessages. {RequestJobStatus, CurrentJobStatus, JobNotFound} +import org.apache.flink.runtime.messages.Messages.Acknowledge +import org.apache.flink.runtime.metrics. {MetricRegistry => FlinkMetricRegistry} +import org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager +import org.apache.flink.runtime.instance.InstanceManager +import org.apache.flink.runtime.jobmanager.scheduler. {Scheduler => FlinkScheduler} + +import scala.concurrent.duration._ +import scala.language.postfixOps + + +/** JobManager actor for execution on Yarn or Mesos. It enriches the [ [JobManager] ] with additional messages + * to start/administer/stop the session. + * + * @param flinkConfiguration Configuration object for the actor + * @param executorService Execution context which is used to execute concurrent tasks in the + * [ [org.apache.flink.runtime.executiongraph.ExecutionGraph] ] + * @param instanceManager Instance manager to manage the registered + * [ [org.apache.flink.runtime.taskmanager.TaskManager] ] + * @param scheduler Scheduler to schedule Flink jobs + * @param libraryCacheManager Manager to manage uploaded jar files + * @param archive Archive for finished Flink jobs + * @param restartStrategyFactory Restart strategy to be used in case of a job recovery + * @param timeout Timeout for futures + * @param leaderElectionService LeaderElectionService to participate in the leader election + */ +abstract class ContaineredJobManager( + flinkConfiguration: FlinkConfiguration, + executorService: ExecutorService, + instanceManager: InstanceManager, + scheduler: FlinkScheduler, + libraryCacheManager: BlobLibraryCacheManager, + archive: ActorRef, + restartStrategyFactory: RestartStrategyFactory, + timeout: FiniteDuration, + leaderElectionService: LeaderElectionService, + submittedJobGraphs : SubmittedJobGraphStore, + checkpointRecoveryFactory : CheckpointRecoveryFactory, + savepointStore: SavepointStore, + jobRecoveryTimeout: FiniteDuration, + metricsRegistry: Option [FlinkMetricRegistry] ) + extends JobManager( + flinkConfiguration, + executorService, + instanceManager, + scheduler, + libraryCacheManager, + archive, + restartStrategyFactory, + timeout, + leaderElectionService, + submittedJobGraphs, + checkpointRecoveryFactory, + savepointStore, + jobRecoveryTimeout, + metricsRegistry) { + + val jobPollingInterval: FiniteDuration + + // indicates if this JM has been started in a dedicated (per-job) mode. + var stopWhenJobFinished: JobID = null + + override def handleMessage: Receive = { + handleContainerMessage orElse super.handleMessage + } + + def handleContainerMessage: Receive = { + + case msg @ (_: RegisterInfoMessageListener | _: UnRegisterInfoMessageListener) => + // forward to ResourceManager + currentResourceManager match { + case Some(rm) => + // we forward the message + rm.forward(decorateMessage(msg)) + case None => + // client has to try again + } + + case msg: ShutdownClusterAfterJob => + val jobId = msg.jobId() + log.info(s"ApplicationMaster will shut down session when job $jobId has finished.") + stopWhenJobFinished = jobId + // trigger regular job status messages (if this is a dedicated/per-job cluster) + if (stopWhenJobFinished != null) { + context.system.scheduler.schedule(0 seconds, — End diff – Can't we listen on the `RemoveJob` message or the `JobStatusChanged` message to get notified when a job has terminated. Then we don't have to poll the status from oneself.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75150474

          — Diff: flink-mesos/src/main/scala/org/apache/flink/mesos/scheduler/Tasks.scala —
          @@ -0,0 +1,114 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.scheduler
          +
          +import akka.actor.

          {Actor, ActorRef, Props}

          +import org.apache.flink.configuration.Configuration
          +import org.apache.flink.mesos.scheduler.ReconciliationCoordinator.Reconcile
          +import org.apache.flink.mesos.scheduler.TaskMonitor.

          {TaskGoalState, TaskGoalStateUpdated, TaskTerminated}

          +import org.apache.flink.mesos.scheduler.Tasks._
          +import org.apache.flink.mesos.scheduler.messages._
          +import org.apache.mesos.

          {SchedulerDriver, Protos}

          +
          +import scala.collection.mutable.

          {Map => MutableMap}

          +
          +/**
          + * Aggregate of monitored tasks.
          + *
          + * Routes messages between the scheduler and individual task monitor actors.
          + */
          +class Tasks[M <: TaskMonitor](
          + flinkConfig: Configuration,
          + schedulerDriver: SchedulerDriver,
          + taskMonitorClass: Class[M]) extends Actor {
          +
          + /**
          + * A map of task monitors by task ID.
          + */
          + private val taskMap: MutableMap[Protos.TaskID,ActorRef] = MutableMap()
          +
          + /**
          + * Cache of current connection state.
          + */
          + private var registered: Option[Any] = None
          +
          + override def preStart(): Unit =

          { + // TODO subscribe to context.system.deadLetters for messages to nonexistent tasks + }

          +
          + override def receive: Receive = {
          +
          + case msg: Disconnected =>
          + registered = None
          + context.actorSelection("*").tell(msg, self)
          +
          + case msg : Connected =>
          + registered = Some(msg)
          + context.actorSelection("*").tell(msg, self)
          +
          + case msg: TaskGoalStateUpdated =>
          + val taskID = msg.state.taskID
          +
          + // ensure task monitor exists
          + if(!taskMap.contains(taskID))

          { + val actorRef = createTask(msg.state) + registered.foreach(actorRef ! _) + }

          +
          + taskMap(taskID) ! msg
          +
          + case msg: StatusUpdate =>
          + taskMap(msg.status().getTaskId) ! msg
          +
          + case msg: Reconcile =>
          + context.parent.forward(msg)
          +
          + case msg: TaskTerminated =>
          + context.parent.forward(msg)
          + }
          +
          + private def createTask(task: TaskGoalState): ActorRef = {
          + val actorProps = TaskMonitor.createActorProps(taskMonitorClass, flinkConfig, schedulerDriver, task)
          — End diff –

          line is longer than 100 characters. This should cause a checkstyle violation.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75150474 — Diff: flink-mesos/src/main/scala/org/apache/flink/mesos/scheduler/Tasks.scala — @@ -0,0 +1,114 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.scheduler + +import akka.actor. {Actor, ActorRef, Props} +import org.apache.flink.configuration.Configuration +import org.apache.flink.mesos.scheduler.ReconciliationCoordinator.Reconcile +import org.apache.flink.mesos.scheduler.TaskMonitor. {TaskGoalState, TaskGoalStateUpdated, TaskTerminated} +import org.apache.flink.mesos.scheduler.Tasks._ +import org.apache.flink.mesos.scheduler.messages._ +import org.apache.mesos. {SchedulerDriver, Protos} + +import scala.collection.mutable. {Map => MutableMap} + +/** + * Aggregate of monitored tasks. + * + * Routes messages between the scheduler and individual task monitor actors. + */ +class Tasks [M <: TaskMonitor] ( + flinkConfig: Configuration, + schedulerDriver: SchedulerDriver, + taskMonitorClass: Class [M] ) extends Actor { + + /** + * A map of task monitors by task ID. + */ + private val taskMap: MutableMap [Protos.TaskID,ActorRef] = MutableMap() + + /** + * Cache of current connection state. + */ + private var registered: Option [Any] = None + + override def preStart(): Unit = { + // TODO subscribe to context.system.deadLetters for messages to nonexistent tasks + } + + override def receive: Receive = { + + case msg: Disconnected => + registered = None + context.actorSelection("*").tell(msg, self) + + case msg : Connected => + registered = Some(msg) + context.actorSelection("*").tell(msg, self) + + case msg: TaskGoalStateUpdated => + val taskID = msg.state.taskID + + // ensure task monitor exists + if(!taskMap.contains(taskID)) { + val actorRef = createTask(msg.state) + registered.foreach(actorRef ! _) + } + + taskMap(taskID) ! msg + + case msg: StatusUpdate => + taskMap(msg.status().getTaskId) ! msg + + case msg: Reconcile => + context.parent.forward(msg) + + case msg: TaskTerminated => + context.parent.forward(msg) + } + + private def createTask(task: TaskGoalState): ActorRef = { + val actorProps = TaskMonitor.createActorProps(taskMonitorClass, flinkConfig, schedulerDriver, task) — End diff – line is longer than 100 characters. This should cause a checkstyle violation.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75149938

          — Diff: flink-mesos/src/main/scala/org/apache/flink/mesos/scheduler/Tasks.scala —
          @@ -0,0 +1,114 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.scheduler
          +
          +import akka.actor.

          {Actor, ActorRef, Props}

          +import org.apache.flink.configuration.Configuration
          +import org.apache.flink.mesos.scheduler.ReconciliationCoordinator.Reconcile
          +import org.apache.flink.mesos.scheduler.TaskMonitor.

          {TaskGoalState, TaskGoalStateUpdated, TaskTerminated}

          +import org.apache.flink.mesos.scheduler.Tasks._
          +import org.apache.flink.mesos.scheduler.messages._
          +import org.apache.mesos.

          {SchedulerDriver, Protos}

          +
          +import scala.collection.mutable.

          {Map => MutableMap}

          +
          +/**
          + * Aggregate of monitored tasks.
          + *
          + * Routes messages between the scheduler and individual task monitor actors.
          + */
          +class Tasks[M <: TaskMonitor](
          + flinkConfig: Configuration,
          + schedulerDriver: SchedulerDriver,
          + taskMonitorClass: Class[M]) extends Actor {
          +
          + /**
          + * A map of task monitors by task ID.
          + */
          + private val taskMap: MutableMap[Protos.TaskID,ActorRef] = MutableMap()
          +
          + /**
          + * Cache of current connection state.
          + */
          + private var registered: Option[Any] = None
          +
          + override def preStart(): Unit = {
          + // TODO subscribe to context.system.deadLetters for messages to nonexistent tasks
          — End diff –

          Can we resolve this TODO?

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75149938 — Diff: flink-mesos/src/main/scala/org/apache/flink/mesos/scheduler/Tasks.scala — @@ -0,0 +1,114 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.scheduler + +import akka.actor. {Actor, ActorRef, Props} +import org.apache.flink.configuration.Configuration +import org.apache.flink.mesos.scheduler.ReconciliationCoordinator.Reconcile +import org.apache.flink.mesos.scheduler.TaskMonitor. {TaskGoalState, TaskGoalStateUpdated, TaskTerminated} +import org.apache.flink.mesos.scheduler.Tasks._ +import org.apache.flink.mesos.scheduler.messages._ +import org.apache.mesos. {SchedulerDriver, Protos} + +import scala.collection.mutable. {Map => MutableMap} + +/** + * Aggregate of monitored tasks. + * + * Routes messages between the scheduler and individual task monitor actors. + */ +class Tasks [M <: TaskMonitor] ( + flinkConfig: Configuration, + schedulerDriver: SchedulerDriver, + taskMonitorClass: Class [M] ) extends Actor { + + /** + * A map of task monitors by task ID. + */ + private val taskMap: MutableMap [Protos.TaskID,ActorRef] = MutableMap() + + /** + * Cache of current connection state. + */ + private var registered: Option [Any] = None + + override def preStart(): Unit = { + // TODO subscribe to context.system.deadLetters for messages to nonexistent tasks — End diff – Can we resolve this TODO?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75149129

          — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/store/ZooKeeperMesosWorkerStore.java —
          @@ -0,0 +1,290 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.runtime.clusterframework.store;
          +
          +import org.apache.curator.framework.CuratorFramework;
          +import org.apache.curator.framework.recipes.shared.SharedCount;
          +import org.apache.curator.framework.recipes.shared.SharedValue;
          +import org.apache.curator.framework.recipes.shared.VersionedValue;
          +import org.apache.flink.api.java.tuple.Tuple2;
          +import org.apache.flink.configuration.ConfigConstants;
          +import org.apache.flink.configuration.Configuration;
          +import org.apache.flink.runtime.state.StateHandle;
          +import org.apache.flink.runtime.util.ZooKeeperUtils;
          +import org.apache.flink.runtime.zookeeper.StateStorageHelper;
          +import org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore;
          +import org.apache.mesos.Protos;
          +import org.apache.zookeeper.KeeperException;
          +import org.slf4j.Logger;
          +import org.slf4j.LoggerFactory;
          +import scala.Option;
          +
          +import java.util.ArrayList;
          +import java.util.Collections;
          +import java.util.ConcurrentModificationException;
          +import java.util.List;
          +
          +import static org.apache.flink.util.Preconditions.checkNotNull;
          +import static org.apache.flink.util.Preconditions.checkState;
          +
          +/**
          + * A ZooKeeper-backed Mesos worker store.
          + */
          +public class ZooKeeperMesosWorkerStore implements MesosWorkerStore {
          +
          + private static final Logger LOG = LoggerFactory.getLogger(ZooKeeperMesosWorkerStore.class);
          +
          + private final Object cacheLock = new Object();
          +
          + /** Client (not a namespace facade) */
          + private final CuratorFramework client;
          +
          + /** Flag indicating whether this instance is running. */
          + private boolean isRunning;
          +
          + /** A persistent value of the assigned framework ID */
          + private final SharedValue frameworkIdInZooKeeper;
          +
          + /** A persistent count of all tasks created, for generating unique IDs */
          + private final SharedCount totalTaskCountInZooKeeper;
          +
          + /** A persistent store of serialized workers */
          + private final ZooKeeperStateHandleStore<MesosWorkerStore.Worker> workersInZooKeeper;
          +
          + @SuppressWarnings("unchecked")
          + ZooKeeperMesosWorkerStore(
          + CuratorFramework client,
          + String storePath,
          + StateStorageHelper<MesosWorkerStore.Worker> stateStorage
          + ) throws Exception {
          + checkNotNull(storePath, "storePath");
          + checkNotNull(stateStorage, "stateStorage");
          +
          + // Keep a reference to the original client and not the namespace facade. The namespace
          + // facade cannot be closed.
          + this.client = checkNotNull(client, "client");
          +
          + // All operations will have the given path as root
          + client.newNamespaceAwareEnsurePath(storePath).ensure(client.getZookeeperClient());
          + CuratorFramework facade = client.usingNamespace(client.getNamespace() + storePath);
          +
          + // Track the assignd framework ID.
          + frameworkIdInZooKeeper = new SharedValue(facade, "/frameworkId", new byte[0]);
          +
          + // Keep a count of all tasks created ever, as the basis for a unique ID.
          + totalTaskCountInZooKeeper = new SharedCount(facade, "/count", 0);
          +
          + // Keep track of the workers in state handle storage.
          + facade.newNamespaceAwareEnsurePath("/workers").ensure(client.getZookeeperClient());
          + CuratorFramework storeFacade = client.usingNamespace(facade.getNamespace() + "/workers");
          +
          + this.workersInZooKeeper = ZooKeeperStateHandleStore.class
          + .getConstructor(CuratorFramework.class, StateStorageHelper.class)
          + .newInstance(storeFacade, stateStorage);
          — End diff –

          I think we can resolve this by configuring the shading properly. I could offer to look into this after the merge...

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75149129 — Diff: flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/store/ZooKeeperMesosWorkerStore.java — @@ -0,0 +1,290 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.runtime.clusterframework.store; + +import org.apache.curator.framework.CuratorFramework; +import org.apache.curator.framework.recipes.shared.SharedCount; +import org.apache.curator.framework.recipes.shared.SharedValue; +import org.apache.curator.framework.recipes.shared.VersionedValue; +import org.apache.flink.api.java.tuple.Tuple2; +import org.apache.flink.configuration.ConfigConstants; +import org.apache.flink.configuration.Configuration; +import org.apache.flink.runtime.state.StateHandle; +import org.apache.flink.runtime.util.ZooKeeperUtils; +import org.apache.flink.runtime.zookeeper.StateStorageHelper; +import org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore; +import org.apache.mesos.Protos; +import org.apache.zookeeper.KeeperException; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import scala.Option; + +import java.util.ArrayList; +import java.util.Collections; +import java.util.ConcurrentModificationException; +import java.util.List; + +import static org.apache.flink.util.Preconditions.checkNotNull; +import static org.apache.flink.util.Preconditions.checkState; + +/** + * A ZooKeeper-backed Mesos worker store. + */ +public class ZooKeeperMesosWorkerStore implements MesosWorkerStore { + + private static final Logger LOG = LoggerFactory.getLogger(ZooKeeperMesosWorkerStore.class); + + private final Object cacheLock = new Object(); + + /** Client (not a namespace facade) */ + private final CuratorFramework client; + + /** Flag indicating whether this instance is running. */ + private boolean isRunning; + + /** A persistent value of the assigned framework ID */ + private final SharedValue frameworkIdInZooKeeper; + + /** A persistent count of all tasks created, for generating unique IDs */ + private final SharedCount totalTaskCountInZooKeeper; + + /** A persistent store of serialized workers */ + private final ZooKeeperStateHandleStore<MesosWorkerStore.Worker> workersInZooKeeper; + + @SuppressWarnings("unchecked") + ZooKeeperMesosWorkerStore( + CuratorFramework client, + String storePath, + StateStorageHelper<MesosWorkerStore.Worker> stateStorage + ) throws Exception { + checkNotNull(storePath, "storePath"); + checkNotNull(stateStorage, "stateStorage"); + + // Keep a reference to the original client and not the namespace facade. The namespace + // facade cannot be closed. + this.client = checkNotNull(client, "client"); + + // All operations will have the given path as root + client.newNamespaceAwareEnsurePath(storePath).ensure(client.getZookeeperClient()); + CuratorFramework facade = client.usingNamespace(client.getNamespace() + storePath); + + // Track the assignd framework ID. + frameworkIdInZooKeeper = new SharedValue(facade, "/frameworkId", new byte [0] ); + + // Keep a count of all tasks created ever, as the basis for a unique ID. + totalTaskCountInZooKeeper = new SharedCount(facade, "/count", 0); + + // Keep track of the workers in state handle storage. + facade.newNamespaceAwareEnsurePath("/workers").ensure(client.getZookeeperClient()); + CuratorFramework storeFacade = client.usingNamespace(facade.getNamespace() + "/workers"); + + this.workersInZooKeeper = ZooKeeperStateHandleStore.class + .getConstructor(CuratorFramework.class, StateStorageHelper.class) + .newInstance(storeFacade, stateStorage); — End diff – I think we can resolve this by configuring the shading properly. I could offer to look into this after the merge...
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on a diff in the pull request:

          https://github.com/apache/flink/pull/2315#discussion_r75147221

          — Diff: flink-mesos/src/main/scala/org/apache/flink/mesos/scheduler/LaunchCoordinator.scala —
          @@ -0,0 +1,349 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one
          + * or more contributor license agreements. See the NOTICE file
          + * distributed with this work for additional information
          + * regarding copyright ownership. The ASF licenses this file
          + * to you under the Apache License, Version 2.0 (the
          + * "License"); you may not use this file except in compliance
          + * with the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.flink.mesos.scheduler
          +
          +import akka.actor.

          {Actor, ActorRef, FSM, Props}

          +import com.netflix.fenzo._
          +import com.netflix.fenzo.functions.Action1
          +import com.netflix.fenzo.plugins.VMLeaseObject
          +import grizzled.slf4j.Logger
          +import org.apache.flink.api.java.tuple.

          {Tuple2=>FlinkTuple2}

          +import org.apache.flink.configuration.Configuration
          +import org.apache.flink.mesos.scheduler.LaunchCoordinator._
          +import org.apache.flink.mesos.scheduler.messages._
          +import org.apache.mesos.Protos.TaskInfo
          +import org.apache.mesos.

          {SchedulerDriver, Protos}

          +
          +import scala.collection.JavaConverters._
          +import scala.collection.mutable.

          {Map => MutableMap}

          +import scala.concurrent.duration._
          +
          +/**
          + * The launch coordinator handles offer processing, including
          + * matching offers to tasks and making reservations.
          + *
          + * The coordinator uses Netflix Fenzo to optimize task placement. During the GatheringOffers phase,
          + * offers are evaluated by Fenzo for suitability to the planned tasks. Reservations are then placed
          + * against the best offers, leading to revised offers containing reserved resources with which to launch task(s).
          + */
          +class LaunchCoordinator(
          + manager: ActorRef,
          + config: Configuration,
          + schedulerDriver: SchedulerDriver,
          + optimizerBuilder: TaskSchedulerBuilder
          + ) extends Actor with FSM[TaskState, GatherData] {
          +
          + val LOG = Logger(getClass)
          +
          + /**
          + * The task placement optimizer.
          + *
          + * The optimizer contains the following state:
          + * - unused offers
          + * - existing task placement (for fitness calculation involving task colocation)
          + */
          + private[mesos] val optimizer: TaskScheduler = {
          + optimizerBuilder
          + .withLeaseRejectAction(new Action1[VirtualMachineLease]() {
          + def call(lease: VirtualMachineLease) {
          + LOG.info(s"Declined offer $

          {lease.getId}

          from $

          {lease.hostname()}

          of $

          {lease.memoryMB()}

          MB, $

          {lease.cpuCores()}

          cpus.")
          + schedulerDriver.declineOffer(lease.getOffer.getId)
          + }
          + }).build
          + }
          +
          + override def postStop(): Unit =

          { + optimizer.shutdown() + super.postStop() + }

          +
          + /**
          + * Initial state
          + */
          + startWith(Suspended, GatherData(tasks = Nil, newLeases = Nil))
          +
          + /**
          + * State: Suspended
          + *
          + * Wait for (re-)connection to Mesos. No offers exist in this state, but outstanding tasks might.
          + */
          + when(Suspended)

          { + case Event(msg: Connected, data: GatherData) => + if(data.tasks.nonEmpty) goto(GatheringOffers) + else goto(Idle) + }

          +
          + /**
          + * State: Idle
          + *
          + * Wait for a task request to arrive, then transition into gathering offers.
          + */
          + onTransition

          { + case _ -> Idle => assert(nextStateData.tasks.isEmpty) + }

          +
          + when(Idle) {
          + case Event(msg: Disconnected, data: GatherData) =>
          + goto(Suspended)
          +
          + case Event(offers: ResourceOffers, data: GatherData) =>
          + // decline any offers that come in
          + schedulerDriver.suppressOffers()
          + for(offer <- offers.offers().asScala)

          { schedulerDriver.declineOffer(offer.getId) }

          + stay()
          +
          + case Event(msg: Launch, data: GatherData) =>
          + goto(GatheringOffers) using data.copy(tasks = data.tasks ++ msg.tasks.asScala)
          + }
          +
          + /**
          + * Transition logic to control the flow of offers.
          + */
          + onTransition {
          + case _ -> GatheringOffers =>
          + LOG.info(s"Now gathering offers for at least $

          {nextStateData.tasks.length}

          task(s).")
          + schedulerDriver.reviveOffers()
          +
          + case GatheringOffers -> _ =>
          + // decline any outstanding offers and suppress future offers
          + LOG.info(s"No longer gathering offers; all requests fulfilled.")
          +
          + assert(nextStateData.newLeases.isEmpty)
          + schedulerDriver.suppressOffers()
          + optimizer.expireAllLeases()
          + }
          +
          + /**
          + * State: GatheringOffers
          + *
          + * Wait for offers to accumulate for a fixed length of time or from specific slaves.
          — End diff –

          Why do we wait for offers being accumulated? Why not starting tasks as soon as we get offers?

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/2315#discussion_r75147221 — Diff: flink-mesos/src/main/scala/org/apache/flink/mesos/scheduler/LaunchCoordinator.scala — @@ -0,0 +1,349 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.mesos.scheduler + +import akka.actor. {Actor, ActorRef, FSM, Props} +import com.netflix.fenzo._ +import com.netflix.fenzo.functions.Action1 +import com.netflix.fenzo.plugins.VMLeaseObject +import grizzled.slf4j.Logger +import org.apache.flink.api.java.tuple. {Tuple2=>FlinkTuple2} +import org.apache.flink.configuration.Configuration +import org.apache.flink.mesos.scheduler.LaunchCoordinator._ +import org.apache.flink.mesos.scheduler.messages._ +import org.apache.mesos.Protos.TaskInfo +import org.apache.mesos. {SchedulerDriver, Protos} + +import scala.collection.JavaConverters._ +import scala.collection.mutable. {Map => MutableMap} +import scala.concurrent.duration._ + +/** + * The launch coordinator handles offer processing, including + * matching offers to tasks and making reservations. + * + * The coordinator uses Netflix Fenzo to optimize task placement. During the GatheringOffers phase, + * offers are evaluated by Fenzo for suitability to the planned tasks. Reservations are then placed + * against the best offers, leading to revised offers containing reserved resources with which to launch task(s). + */ +class LaunchCoordinator( + manager: ActorRef, + config: Configuration, + schedulerDriver: SchedulerDriver, + optimizerBuilder: TaskSchedulerBuilder + ) extends Actor with FSM [TaskState, GatherData] { + + val LOG = Logger(getClass) + + /** + * The task placement optimizer. + * + * The optimizer contains the following state: + * - unused offers + * - existing task placement (for fitness calculation involving task colocation) + */ + private [mesos] val optimizer: TaskScheduler = { + optimizerBuilder + .withLeaseRejectAction(new Action1 [VirtualMachineLease] () { + def call(lease: VirtualMachineLease) { + LOG.info(s"Declined offer $ {lease.getId} from $ {lease.hostname()} of $ {lease.memoryMB()} MB, $ {lease.cpuCores()} cpus.") + schedulerDriver.declineOffer(lease.getOffer.getId) + } + }).build + } + + override def postStop(): Unit = { + optimizer.shutdown() + super.postStop() + } + + /** + * Initial state + */ + startWith(Suspended, GatherData(tasks = Nil, newLeases = Nil)) + + /** + * State: Suspended + * + * Wait for (re-)connection to Mesos. No offers exist in this state, but outstanding tasks might. + */ + when(Suspended) { + case Event(msg: Connected, data: GatherData) => + if(data.tasks.nonEmpty) goto(GatheringOffers) + else goto(Idle) + } + + /** + * State: Idle + * + * Wait for a task request to arrive, then transition into gathering offers. + */ + onTransition { + case _ -> Idle => assert(nextStateData.tasks.isEmpty) + } + + when(Idle) { + case Event(msg: Disconnected, data: GatherData) => + goto(Suspended) + + case Event(offers: ResourceOffers, data: GatherData) => + // decline any offers that come in + schedulerDriver.suppressOffers() + for(offer <- offers.offers().asScala) { schedulerDriver.declineOffer(offer.getId) } + stay() + + case Event(msg: Launch, data: GatherData) => + goto(GatheringOffers) using data.copy(tasks = data.tasks ++ msg.tasks.asScala) + } + + /** + * Transition logic to control the flow of offers. + */ + onTransition { + case _ -> GatheringOffers => + LOG.info(s"Now gathering offers for at least $ {nextStateData.tasks.length} task(s).") + schedulerDriver.reviveOffers() + + case GatheringOffers -> _ => + // decline any outstanding offers and suppress future offers + LOG.info(s"No longer gathering offers; all requests fulfilled.") + + assert(nextStateData.newLeases.isEmpty) + schedulerDriver.suppressOffers() + optimizer.expireAllLeases() + } + + /** + * State: GatheringOffers + * + * Wait for offers to accumulate for a fixed length of time or from specific slaves. — End diff – Why do we wait for offers being accumulated? Why not starting tasks as soon as we get offers?