Uploaded image for project: 'Apache Twill'
  1. Apache Twill
  2. TWILL-173

Application Master failed with BindException occasionally

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.6.0-incubating, 0.7.0-incubating
    • Fix Version/s: 0.8.0
    • Component/s: core, yarn
    • Labels:
      None

      Description

      When the AM starts the embedded Kafka, it first generates a random port (by creating a server socket), followed by provided that port for the Kafka server to bind to. It is possible that after the random port was acquired and before Kafka server bind to it, there is another process on the same box that took that port.

      1. stdout
        65 kB
        Ali Anwar
      2. stderr
        35 kB
        Ali Anwar

        Issue Links

          Activity

          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user chtyim opened a pull request:

          https://github.com/apache/twill/pull/9

          (TWILL-173) Have EmbeddedKafkaServer restart multiple times on bind failure

          • Due to potential race condition between random port generation vs actual binding, there is a possibility that the binding would fail.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/chtyim/twill feature/twill-173

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/twill/pull/9.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #9


          commit ad81f4d73e7161ea676dbb86d2c44b1428f6e976
          Author: Terence Yim <chtyim@apache.org>
          Date: 2016-08-31T21:38:19Z

          (TWILL-173) Have EmbeddedKafkaServer restart multiple times on bind failure

          • Due to potential race condition between random port generation vs actual binding, there is a possibility that the binding would fail.

          Show
          githubbot ASF GitHub Bot added a comment - GitHub user chtyim opened a pull request: https://github.com/apache/twill/pull/9 ( TWILL-173 ) Have EmbeddedKafkaServer restart multiple times on bind failure Due to potential race condition between random port generation vs actual binding, there is a possibility that the binding would fail. You can merge this pull request into a Git repository by running: $ git pull https://github.com/chtyim/twill feature/twill-173 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/twill/pull/9.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9 commit ad81f4d73e7161ea676dbb86d2c44b1428f6e976 Author: Terence Yim <chtyim@apache.org> Date: 2016-08-31T21:38:19Z ( TWILL-173 ) Have EmbeddedKafkaServer restart multiple times on bind failure Due to potential race condition between random port generation vs actual binding, there is a possibility that the binding would fail.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user albertshau commented on a diff in the pull request:

          https://github.com/apache/twill/pull/9#discussion_r77081138

          — Diff: twill-core/src/main/java/org/apache/twill/internal/kafka/EmbeddedKafkaServer.java —
          @@ -65,9 +72,19 @@ protected void startUp() throws Exception {
          if (rootCause instanceof ZkTimeoutException) {
          // Potentially caused by race condition bug described in TWILL-139.
          LOG.warn("Timeout when connecting to ZooKeeper from KafkaServer. Attempt number {}.", tries, rootCause);
          + } else if (rootCause instanceof BindException) {
          + LOG.warn("Kafka failed to bind to port {}. Attempt number {}.", kafkaConfig.port(), tries, rootCause);
          } else

          { throw e; }

          +
          + // Do a random sleep of < 200ms
          + TimeUnit.MILLISECONDS.sleep(new Random().nextInt(200) + 1L);
          +
          + // Generate a new port for the Kafka
          + int port = Networks.getRandomPort();
          + Preconditions.checkState(port > 0, "Failed to get random port.");
          + properties.setProperty("port", Integer.toString(port));
          — End diff –

          so "port" is required to be in the Properties passed to KafkaConfig? Is there any place where a port is being set in the Properties passed to the constructor, and then used somewhere else?

          Show
          githubbot ASF GitHub Bot added a comment - Github user albertshau commented on a diff in the pull request: https://github.com/apache/twill/pull/9#discussion_r77081138 — Diff: twill-core/src/main/java/org/apache/twill/internal/kafka/EmbeddedKafkaServer.java — @@ -65,9 +72,19 @@ protected void startUp() throws Exception { if (rootCause instanceof ZkTimeoutException) { // Potentially caused by race condition bug described in TWILL-139 . LOG.warn("Timeout when connecting to ZooKeeper from KafkaServer. Attempt number {}.", tries, rootCause); + } else if (rootCause instanceof BindException) { + LOG.warn("Kafka failed to bind to port {}. Attempt number {}.", kafkaConfig.port(), tries, rootCause); } else { throw e; } + + // Do a random sleep of < 200ms + TimeUnit.MILLISECONDS.sleep(new Random().nextInt(200) + 1L); + + // Generate a new port for the Kafka + int port = Networks.getRandomPort(); + Preconditions.checkState(port > 0, "Failed to get random port."); + properties.setProperty("port", Integer.toString(port)); — End diff – so "port" is required to be in the Properties passed to KafkaConfig? Is there any place where a port is being set in the Properties passed to the constructor, and then used somewhere else?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user chtyim commented on a diff in the pull request:

          https://github.com/apache/twill/pull/9#discussion_r77082082

          — Diff: twill-core/src/main/java/org/apache/twill/internal/kafka/EmbeddedKafkaServer.java —
          @@ -65,9 +72,19 @@ protected void startUp() throws Exception {
          if (rootCause instanceof ZkTimeoutException) {
          // Potentially caused by race condition bug described in TWILL-139.
          LOG.warn("Timeout when connecting to ZooKeeper from KafkaServer. Attempt number {}.", tries, rootCause);
          + } else if (rootCause instanceof BindException) {
          + LOG.warn("Kafka failed to bind to port {}. Attempt number {}.", kafkaConfig.port(), tries, rootCause);
          } else

          { throw e; }

          +
          + // Do a random sleep of < 200ms
          + TimeUnit.MILLISECONDS.sleep(new Random().nextInt(200) + 1L);
          +
          + // Generate a new port for the Kafka
          + int port = Networks.getRandomPort();
          + Preconditions.checkState(port > 0, "Failed to get random port.");
          + properties.setProperty("port", Integer.toString(port));
          — End diff –

          Currently it's set in the ApplicationMasterService. Probably better to have it left empty (or set to 0) in order to trigger this logic.

          Show
          githubbot ASF GitHub Bot added a comment - Github user chtyim commented on a diff in the pull request: https://github.com/apache/twill/pull/9#discussion_r77082082 — Diff: twill-core/src/main/java/org/apache/twill/internal/kafka/EmbeddedKafkaServer.java — @@ -65,9 +72,19 @@ protected void startUp() throws Exception { if (rootCause instanceof ZkTimeoutException) { // Potentially caused by race condition bug described in TWILL-139 . LOG.warn("Timeout when connecting to ZooKeeper from KafkaServer. Attempt number {}.", tries, rootCause); + } else if (rootCause instanceof BindException) { + LOG.warn("Kafka failed to bind to port {}. Attempt number {}.", kafkaConfig.port(), tries, rootCause); } else { throw e; } + + // Do a random sleep of < 200ms + TimeUnit.MILLISECONDS.sleep(new Random().nextInt(200) + 1L); + + // Generate a new port for the Kafka + int port = Networks.getRandomPort(); + Preconditions.checkState(port > 0, "Failed to get random port."); + properties.setProperty("port", Integer.toString(port)); — End diff – Currently it's set in the ApplicationMasterService. Probably better to have it left empty (or set to 0) in order to trigger this logic.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user albertshau commented on a diff in the pull request:

          https://github.com/apache/twill/pull/9#discussion_r77083133

          — Diff: twill-core/src/main/java/org/apache/twill/internal/kafka/EmbeddedKafkaServer.java —
          @@ -65,9 +72,19 @@ protected void startUp() throws Exception {
          if (rootCause instanceof ZkTimeoutException) {
          // Potentially caused by race condition bug described in TWILL-139.
          LOG.warn("Timeout when connecting to ZooKeeper from KafkaServer. Attempt number {}.", tries, rootCause);
          + } else if (rootCause instanceof BindException) {
          + LOG.warn("Kafka failed to bind to port {}. Attempt number {}.", kafkaConfig.port(), tries, rootCause);
          } else

          { throw e; }

          +
          + // Do a random sleep of < 200ms
          + TimeUnit.MILLISECONDS.sleep(new Random().nextInt(200) + 1L);
          +
          + // Generate a new port for the Kafka
          + int port = Networks.getRandomPort();
          + Preconditions.checkState(port > 0, "Failed to get random port.");
          + properties.setProperty("port", Integer.toString(port));
          — End diff –

          Should we only do this if its originally set to 0 or left empty? Without reading the code I would expect it to connect to the port I set, or not connect at all.

          Or is this not a concern because the port is not used to connect?

          Show
          githubbot ASF GitHub Bot added a comment - Github user albertshau commented on a diff in the pull request: https://github.com/apache/twill/pull/9#discussion_r77083133 — Diff: twill-core/src/main/java/org/apache/twill/internal/kafka/EmbeddedKafkaServer.java — @@ -65,9 +72,19 @@ protected void startUp() throws Exception { if (rootCause instanceof ZkTimeoutException) { // Potentially caused by race condition bug described in TWILL-139 . LOG.warn("Timeout when connecting to ZooKeeper from KafkaServer. Attempt number {}.", tries, rootCause); + } else if (rootCause instanceof BindException) { + LOG.warn("Kafka failed to bind to port {}. Attempt number {}.", kafkaConfig.port(), tries, rootCause); } else { throw e; } + + // Do a random sleep of < 200ms + TimeUnit.MILLISECONDS.sleep(new Random().nextInt(200) + 1L); + + // Generate a new port for the Kafka + int port = Networks.getRandomPort(); + Preconditions.checkState(port > 0, "Failed to get random port."); + properties.setProperty("port", Integer.toString(port)); — End diff – Should we only do this if its originally set to 0 or left empty? Without reading the code I would expect it to connect to the port I set, or not connect at all. Or is this not a concern because the port is not used to connect?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user albertshau commented on the issue:

          https://github.com/apache/twill/pull/9

          lgtm

          Show
          githubbot ASF GitHub Bot added a comment - Github user albertshau commented on the issue: https://github.com/apache/twill/pull/9 lgtm
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/twill/pull/9

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/twill/pull/9
          Hide
          anwar1 Ali Anwar added a comment -

          Terence Yim I don't think this issue is actually fixed. See the stdout and stderr of an ApplicationMaster for a YarnApplication that failed to start up due to this BindException.
          There was a retry implemented, but it seems like its retrying against the same port each time.

          Show
          anwar1 Ali Anwar added a comment - Terence Yim I don't think this issue is actually fixed. See the stdout and stderr of an ApplicationMaster for a YarnApplication that failed to start up due to this BindException. There was a retry implemented, but it seems like its retrying against the same port each time.
          Hide
          chtyim Terence Yim added a comment -

          The line number in the stacktrace doesn't match with the ApplicationMasterMain class. I don't think you are running with the version of Twill that has the fix included.

          Show
          chtyim Terence Yim added a comment - The line number in the stacktrace doesn't match with the ApplicationMasterMain class. I don't think you are running with the version of Twill that has the fix included.

            People

            • Assignee:
              chtyim Terence Yim
              Reporter:
              chtyim Terence Yim
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development