Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-6643

TestRMFailover fails rarely due to port conflict

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.9.0, 3.0.0-alpha4
    • Fix Version/s: 2.9.0, 3.0.0-alpha4, 2.8.2
    • Component/s: test
    • Labels:
      None

      Description

      We've seen various tests in TestRMFailover fail very rarely with a message like "org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: ResourceManager failed to start. Final state is STOPPED".

      After some digging, it turns out that it's due to a port conflict with the embedded ZooKeeper in the tests. The embedded ZooKeeper uses ServerSocketUtil#getPort to choose a free port, but the RMs are configured to 10000 + <default-port> and 20000 + <default-port> (e.g. the default port for the RM is 8032, so you'd use 18032 and 28032).

      When I was able to reproduce this, I saw that ZooKeeper was using port 18033, which is 10000 + 8033, the default RM Admin port. It results in an error like this, causing the RM to be unable to start, and hence the original error message in the test failure:

      2017-05-24 01:16:52,735 INFO  service.AbstractService (AbstractService.java:noteFailure(272)) - Service ResourceManager failed in state STARTED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.BindException: Problem binding to [0.0.0.0:18033] java.net.BindException: Address already in use; For more details see:  http://wiki.apache.org/hadoop/BindException
      org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.BindException: Problem binding to [0.0.0.0:18033] java.net.BindException: Address already in use; For more details see:  http://wiki.apache.org/hadoop/BindException
              at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:139)
              at org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65)
              at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54)
              at org.apache.hadoop.yarn.server.resourcemanager.AdminService.startServer(AdminService.java:171)
              at org.apache.hadoop.yarn.server.resourcemanager.AdminService.serviceStart(AdminService.java:158)
              at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
              at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
              at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1147)
              at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
              at org.apache.hadoop.yarn.server.MiniYARNCluster$2.run(MiniYARNCluster.java:310)
      Caused by: java.net.BindException: Problem binding to [0.0.0.0:18033] java.net.BindException: Address already in use; For more details see:  http://wiki.apache.org/hadoop/BindException
              at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
              at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
              at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
              at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
              at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
              at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:720)
              at org.apache.hadoop.ipc.Server.bind(Server.java:482)
              at org.apache.hadoop.ipc.Server$Listener.<init>(Server.java:688)
              at org.apache.hadoop.ipc.Server.<init>(Server.java:2376)
              at org.apache.hadoop.ipc.RPC$Server.<init>(RPC.java:1042)
              at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.<init>(ProtobufRpcEngine.java:535)
              at org.apache.hadoop.ipc.ProtobufRpcEngine.getServer(ProtobufRpcEngine.java:510)
              at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:887)
              at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.createServer(RpcServerFactoryPBImpl.java:169)
              at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:132)
              ... 9 more
      Caused by: java.net.BindException: Address already in use
              at sun.nio.ch.Net.bind0(Native Method)
              at sun.nio.ch.Net.bind(Net.java:444)
              at sun.nio.ch.Net.bind(Net.java:436)
              at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
              at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
              at org.apache.hadoop.ipc.Server.bind(Server.java:465)
              ... 17 more
      2017-05-24 01:16:52,736 DEBUG service.AbstractService (AbstractService.java:enterState(452)) - Service: ResourceManager entered state STOPPED
      

        Issue Links

          Activity

          Hide
          rkanter Robert Kanter added a comment -

          The 001 patch fixes the problem by using ServerSocketUtil#getPort when setting the RM ports. It still tries to use the existing method of picking a port, but this will ensure that if those are busy, it will find a free one.

          I was able to verify the fix by forcing ZooKeeper to pick port 18033 and seeing that the tests all pass with the patch but fail without it.

          Show
          rkanter Robert Kanter added a comment - The 001 patch fixes the problem by using ServerSocketUtil#getPort when setting the RM ports. It still tries to use the existing method of picking a port, but this will ensure that if those are busy, it will find a free one. I was able to verify the fix by forcing ZooKeeper to pick port 18033 and seeing that the tests all pass with the patch but fail without it.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 22s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 14m 21s trunk passed
          +1 compile 0m 34s trunk passed
          +1 checkstyle 0m 25s trunk passed
          +1 mvnsite 0m 38s trunk passed
          +1 mvneclipse 0m 18s trunk passed
          +1 findbugs 1m 1s trunk passed
          +1 javadoc 0m 22s trunk passed
          +1 mvninstall 0m 36s the patch passed
          +1 compile 0m 35s the patch passed
          +1 javac 0m 35s the patch passed
          +1 checkstyle 0m 26s the patch passed
          +1 mvnsite 0m 35s the patch passed
          +1 mvneclipse 0m 17s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 8s the patch passed
          +1 javadoc 0m 20s the patch passed
          -1 unit 42m 59s hadoop-yarn-server-resourcemanager in the patch failed.
          +1 asflicense 0m 21s The patch does not generate ASF License warnings.
          66m 37s



          Reason Tests
          Timed out junit tests org.apache.hadoop.yarn.server.resourcemanager.TestRMStoreCommands
            org.apache.hadoop.yarn.server.resourcemanager.TestSubmitApplicationWithRMHA
            org.apache.hadoop.yarn.server.resourcemanager.TestKillApplicationWithRMHA



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:14b5c93
          JIRA Issue YARN-6643
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12869806/YARN-6643.001.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux ab1e0ee1cf4c 3.13.0-116-generic #163-Ubuntu SMP Fri Mar 31 14:13:22 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / d049bd2
          Default Java 1.8.0_131
          findbugs v3.1.0-RC1
          unit https://builds.apache.org/job/PreCommit-YARN-Build/16017/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/16017/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/16017/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 22s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 14m 21s trunk passed +1 compile 0m 34s trunk passed +1 checkstyle 0m 25s trunk passed +1 mvnsite 0m 38s trunk passed +1 mvneclipse 0m 18s trunk passed +1 findbugs 1m 1s trunk passed +1 javadoc 0m 22s trunk passed +1 mvninstall 0m 36s the patch passed +1 compile 0m 35s the patch passed +1 javac 0m 35s the patch passed +1 checkstyle 0m 26s the patch passed +1 mvnsite 0m 35s the patch passed +1 mvneclipse 0m 17s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 8s the patch passed +1 javadoc 0m 20s the patch passed -1 unit 42m 59s hadoop-yarn-server-resourcemanager in the patch failed. +1 asflicense 0m 21s The patch does not generate ASF License warnings. 66m 37s Reason Tests Timed out junit tests org.apache.hadoop.yarn.server.resourcemanager.TestRMStoreCommands   org.apache.hadoop.yarn.server.resourcemanager.TestSubmitApplicationWithRMHA   org.apache.hadoop.yarn.server.resourcemanager.TestKillApplicationWithRMHA Subsystem Report/Notes Docker Image:yetus/hadoop:14b5c93 JIRA Issue YARN-6643 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12869806/YARN-6643.001.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux ab1e0ee1cf4c 3.13.0-116-generic #163-Ubuntu SMP Fri Mar 31 14:13:22 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / d049bd2 Default Java 1.8.0_131 findbugs v3.1.0-RC1 unit https://builds.apache.org/job/PreCommit-YARN-Build/16017/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/16017/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/16017/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          jlowe Jason Lowe added a comment -

          +1 lgtm. The unit tests that failed don't even call the code that was changed. I was able to reproduce one of the tests exiting early and filed YARN-6647. I'll commit this later today if there are no objections.

          Show
          jlowe Jason Lowe added a comment - +1 lgtm. The unit tests that failed don't even call the code that was changed. I was able to reproduce one of the tests exiting early and filed YARN-6647 . I'll commit this later today if there are no objections.
          Hide
          jlowe Jason Lowe added a comment -

          Thanks, Robert! I committed this to trunk, branch-2, and branch-2.8.

          Show
          jlowe Jason Lowe added a comment - Thanks, Robert! I committed this to trunk, branch-2, and branch-2.8.
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11784 (See https://builds.apache.org/job/Hadoop-trunk-Commit/11784/)
          YARN-6643. TestRMFailover fails rarely due to port conflict. Contributed (jlowe: rev 3fd6a2da4e537423d1462238e10cc9e1f698d1c2)

          • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/HATestUtil.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11784 (See https://builds.apache.org/job/Hadoop-trunk-Commit/11784/ ) YARN-6643 . TestRMFailover fails rarely due to port conflict. Contributed (jlowe: rev 3fd6a2da4e537423d1462238e10cc9e1f698d1c2) (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/HATestUtil.java
          Hide
          rkanter Robert Kanter added a comment -

          Thanks Jason!

          Show
          rkanter Robert Kanter added a comment - Thanks Jason!

            People

            • Assignee:
              rkanter Robert Kanter
              Reporter:
              rkanter Robert Kanter
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development