Details

    • Type: Sub-task Sub-task
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Cannot Reproduce
    • Affects Version/s: 1.6.0
    • Fix Version/s: None
    • Component/s: maven-plugin, mini
    • Labels:
      None
    • Environment:

      Maven 3.0.4 on JDK 6u31 on Jenkins

      Description

      I've seen the following three tests fail multiple times over nightly builds.

      Test Result (3 failures / +3)
      org.apache.accumulo.minicluster.MiniAccumuloClusterTest.org.apache.accumulo.minicluster.MiniAccumuloClusterTest
      org.apache.accumulo.minicluster.impl.MiniAccumuloClusterGCTest.org.apache.accumulo.minicluster.impl.MiniAccumuloClusterGCTest
      org.apache.accumulo.minicluster.impl.MiniAccumuloClusterImplTest.org.apache.accumulo.minicluster.impl.MiniAccumuloClusterImplTest
      

        Issue Links

          Activity

          Hide
          Mike Drob added a comment - - edited

          Clicking through to one of them yields the following exception message:

          Zookeeper did not start within 20 seconds. Check the logs in /var/lib/jenkins/workspace/apache-accumulo-1.6/minicluster/target/mini-tests/org.apache.accumulo.minicluster.MiniAccumuloClusterTest/logs for errors.  Last exception: java.net.ConnectException: Connection refused
          

          Going to that directory, I find an empty ZooKeeperServerMain_1480462011.err and a single line in the .out file:

          Listening for transport dt_socket at address: 33078
          

          The port changes between runs, but I'm fairly sure that is expected. However, looking at conf/zoo.cfg I see:

          #Wed Mar 26 05:38:18 PDT 2014
          maxClientCnxns=1000
          clientPort=26281
          dataDir=/var/lib/jenkins/workspace/apache-accumulo-1.6/minicluster/target/mini-tests/org.apache.accumulo.minicluster.MiniAccumuloClusterTest/zookeeper
          syncLimit=5
          initLimit=10
          tickTime=2000
          

          So for some reason the ports don't match up.

          Show
          Mike Drob added a comment - - edited Clicking through to one of them yields the following exception message: Zookeeper did not start within 20 seconds. Check the logs in /var/lib/jenkins/workspace/apache-accumulo-1.6/minicluster/target/mini-tests/org.apache.accumulo.minicluster.MiniAccumuloClusterTest/logs for errors. Last exception: java.net.ConnectException: Connection refused Going to that directory, I find an empty ZooKeeperServerMain_1480462011.err and a single line in the .out file: Listening for transport dt_socket at address: 33078 The port changes between runs, but I'm fairly sure that is expected. However, looking at conf/zoo.cfg I see: #Wed Mar 26 05:38:18 PDT 2014 maxClientCnxns=1000 clientPort=26281 dataDir=/var/lib/jenkins/workspace/apache-accumulo-1.6/minicluster/target/mini-tests/org.apache.accumulo.minicluster.MiniAccumuloClusterTest/zookeeper syncLimit=5 initLimit=10 tickTime=2000 So for some reason the ports don't match up.
          Hide
          Sean Busbey added a comment -

          I also have this failing the maven-plugin module

          Show
          Sean Busbey added a comment - I also have this failing the maven-plugin module
          Hide
          Christopher Tubbs added a comment -

          I am unable to reproduce this with Maven 3.0.4 or 3.1.1. Was there something else running on the port? Which OS is this failing on? Have you tried with 3.1.1 or newer Maven? Was this build running concurrently with another build that may have caused a port conflict? Was mvn executed with -T to do parallel builds across modules that both use mini?

          Show
          Christopher Tubbs added a comment - I am unable to reproduce this with Maven 3.0.4 or 3.1.1. Was there something else running on the port? Which OS is this failing on? Have you tried with 3.1.1 or newer Maven? Was this build running concurrently with another build that may have caused a port conflict? Was mvn executed with -T to do parallel builds across modules that both use mini?
          Hide
          Sean Busbey added a comment -

          In my case this is on a Jenkins box.

          • the run was on a CentOS 6 box.
          • Jenkins is configured to only allow one Accumulo job on that box at a time, so it shouldn't have anything concurrent.
          • This was after I stopped doing parallel builds because of IT failures (although that had been parallel with the IT forking arg and not -T)
          • JDK6u31
          • Maven 3.0.4

          I can try to get a newer Maven set up, but it will probably take some time. I should be able to rely on 3.0.4 right?

          Show
          Sean Busbey added a comment - In my case this is on a Jenkins box. the run was on a CentOS 6 box. Jenkins is configured to only allow one Accumulo job on that box at a time, so it shouldn't have anything concurrent. This was after I stopped doing parallel builds because of IT failures (although that had been parallel with the IT forking arg and not -T) JDK6u31 Maven 3.0.4 I can try to get a newer Maven set up, but it will probably take some time. I should be able to rely on 3.0.4 right?
          Hide
          Christopher Tubbs added a comment -

          There shouldn't be any issues with 3.0.4. I guess this is just an issue that plagues mini in general (not being able to reserve a ZK port). I'm not sure if there's a good fix. How frequently do these happen for you? And, do you see any extra failures with the maven-plugin specifically on this (e.g. more reproducible in that module than the test module ITs)?

          Show
          Christopher Tubbs added a comment - There shouldn't be any issues with 3.0.4. I guess this is just an issue that plagues mini in general (not being able to reserve a ZK port). I'm not sure if there's a good fix. How frequently do these happen for you? And, do you see any extra failures with the maven-plugin specifically on this (e.g. more reproducible in that module than the test module ITs)?
          Hide
          Mike Drob added a comment -

          These failures happen approximately 50% of the time on our Jenkins.

          I have two ideas for how we can resolve this, but both seem a little heavy-handed.

          1. We could switch from starting a full separate ZK to using an embedded one in the process. There are several ways to do this, some of which would require us to bump up our dependency from 3.3 to 3.4
          2. We could provide configuration options to allow for using an existing ZK.

          Assuming that nobody figures out what this error actually is, I'd prefer the first solution but can see the merits of both.

          Show
          Mike Drob added a comment - These failures happen approximately 50% of the time on our Jenkins. I have two ideas for how we can resolve this, but both seem a little heavy-handed. We could switch from starting a full separate ZK to using an embedded one in the process. There are several ways to do this , some of which would require us to bump up our dependency from 3.3 to 3.4 We could provide configuration options to allow for using an existing ZK. Assuming that nobody figures out what this error actually is, I'd prefer the first solution but can see the merits of both.
          Hide
          Josh Elser added a comment -

          I don't recall ever seeing this either. I guess you have your work cut out to figure out what's happening

          Show
          Josh Elser added a comment - I don't recall ever seeing this either. I guess you have your work cut out to figure out what's happening
          Hide
          Keith Turner added a comment -

          I picked a random MAC dir from a 1.6.0-SNAPSHOT run yesterday.

          Looking at the following file :

          /local/disk1/jenkins/workspace/accumulo16/test/target/mini-tests/org.apache.accumulo.test.DumpConfigIT_test/logs/ZooKeeperServerMain_1656483996.out
          

          I see

          2014-04-16 17:10:10,077 [quorum.QuorumPeerConfig] INFO : Reading configuration from: /local/disk1/jenkins/workspace/accumulo16/test/target/mini-tests/org.apache.accumulo.test.DumpConfigIT_test/conf/zoo.cfg
          2014-04-16 17:10:10,078 [server.ZooKeeperServerMain] INFO : Starting server
          

          then I see

          2014-04-16 17:10:10,135 [server.NIOServerCnxn] INFO : binding to port 0.0.0.0/0.0.0.0:59228
          

          There are no lines in the log containing the string "dt_socket". I grepped every zookeeper out file for that string and did not see it.

          In the past DNS was flaky on my Jenkins machine and for some reason zookeeper would hang. I jstack the zookeeper server process and saw it was hung in java name resolution calls during initialization. Also I know there was very little in the zk out when this happened, but I can not remember exactly what. Have you been able to jstack?

          Show
          Keith Turner added a comment - I picked a random MAC dir from a 1.6.0-SNAPSHOT run yesterday. Looking at the following file : /local/disk1/jenkins/workspace/accumulo16/test/target/mini-tests/org.apache.accumulo.test.DumpConfigIT_test/logs/ZooKeeperServerMain_1656483996.out I see 2014-04-16 17:10:10,077 [quorum.QuorumPeerConfig] INFO : Reading configuration from: /local/disk1/jenkins/workspace/accumulo16/test/target/mini-tests/org.apache.accumulo.test.DumpConfigIT_test/conf/zoo.cfg 2014-04-16 17:10:10,078 [server.ZooKeeperServerMain] INFO : Starting server then I see 2014-04-16 17:10:10,135 [server.NIOServerCnxn] INFO : binding to port 0.0.0.0/0.0.0.0:59228 There are no lines in the log containing the string "dt_socket". I grepped every zookeeper out file for that string and did not see it. In the past DNS was flaky on my Jenkins machine and for some reason zookeeper would hang. I jstack the zookeeper server process and saw it was hung in java name resolution calls during initialization. Also I know there was very little in the zk out when this happened, but I can not remember exactly what. Have you been able to jstack?
          Hide
          Mike Drob added a comment -

          These jobs are only failing on a Jenkins instance that I do not have shell access to, so I cannot jstack the processes.

          Show
          Mike Drob added a comment - These jobs are only failing on a Jenkins instance that I do not have shell access to, so I cannot jstack the processes.
          Hide
          Keith Turner added a comment -

          Earlier when I looked for dt_socket I was looking in test/target/mini-test and found nothing. I tried looking in minicluster/target/mini-test and did find dt_socket messages.

          $ pwd
          /local/disk1/jenkins/workspace/accumulo16/minicluster/target/mini-tests
          $ find . -name "*Zoo*.out" | xargs grep "dt_socket"
          ./org.apache.accumulo.minicluster.impl.MiniAccumuloClusterImplTest/logs/ZooKeeperServerMain_1826463271.out:Listening for transport dt_socket at address: 37923
          ./org.apache.accumulo.minicluster.MiniAccumuloClusterTest/logs/ZooKeeperServerMain_523145999.out:Listening for transport dt_socket at address: 48270
          

          Looking into this further, this dt_socket message is not the zookeeper port.

          $ cd ./org.apache.accumulo.minicluster.impl.MiniAccumuloClusterImplTest/logs/
          $ grep dt_socket ZooKeeperServerMain_1826463271.out 
          Listening for transport dt_socket at address: 37923
          $ grep inding ZooKeeperServerMain_1826463271.out 
          2014-04-16 15:30:26,212 [server.NIOServerCnxn] INFO : binding to port 0.0.0.0/0.0.0.0:2617
          $ grep 2617 ../conf/zoo.cfg 
          clientPort=2617
          

          Maybe the dt_socket port is a debug or jmx port??? Whatever it is, its the very first think printed and if you see nothing else, then it implies zookeeper is hung. I think it would be really useful to jstack one of these hung zookeeper servers if you can catch it.

          Show
          Keith Turner added a comment - Earlier when I looked for dt_socket I was looking in test/target/mini-test and found nothing. I tried looking in minicluster/target/mini-test and did find dt_socket messages. $ pwd /local/disk1/jenkins/workspace/accumulo16/minicluster/target/mini-tests $ find . -name "*Zoo*.out" | xargs grep "dt_socket" ./org.apache.accumulo.minicluster.impl.MiniAccumuloClusterImplTest/logs/ZooKeeperServerMain_1826463271.out:Listening for transport dt_socket at address: 37923 ./org.apache.accumulo.minicluster.MiniAccumuloClusterTest/logs/ZooKeeperServerMain_523145999.out:Listening for transport dt_socket at address: 48270 Looking into this further, this dt_socket message is not the zookeeper port. $ cd ./org.apache.accumulo.minicluster.impl.MiniAccumuloClusterImplTest/logs/ $ grep dt_socket ZooKeeperServerMain_1826463271.out Listening for transport dt_socket at address: 37923 $ grep inding ZooKeeperServerMain_1826463271.out 2014-04-16 15:30:26,212 [server.NIOServerCnxn] INFO : binding to port 0.0.0.0/0.0.0.0:2617 $ grep 2617 ../conf/zoo.cfg clientPort=2617 Maybe the dt_socket port is a debug or jmx port??? Whatever it is, its the very first think printed and if you see nothing else, then it implies zookeeper is hung. I think it would be really useful to jstack one of these hung zookeeper servers if you can catch it.
          Hide
          Keith Turner added a comment -

          Googlin Listening for transport dt_socket at address seems to indicate that message is related to enabling JDWP. Those two test that had the message call setJDWPEnabled(true). So maybe you could connect to the debug port an get a jstack

          Show
          Keith Turner added a comment - Googlin Listening for transport dt_socket at address seems to indicate that message is related to enabling JDWP. Those two test that had the message call setJDWPEnabled(true) . So maybe you could connect to the debug port an get a jstack
          Hide
          Keith Turner added a comment -

          Re starting a zookeeper in process. This might be cool, especially if it sped up MAC. However it may not fix this issue, if zookeeper is going to hang on something like a bad DNS config then it may not matter if its in process. It would certainly be worth trying.

          Show
          Keith Turner added a comment - Re starting a zookeeper in process. This might be cool, especially if it sped up MAC. However it may not fix this issue, if zookeeper is going to hang on something like a bad DNS config then it may not matter if its in process. It would certainly be worth trying.
          Hide
          Keith Turner added a comment -

          I just remembered something. When I was seeing ZK server hang because of DNS issues, if I gave more time for ZK to start then something would timeout and the ZK server would eventually start.

          Show
          Keith Turner added a comment - I just remembered something. When I was seeing ZK server hang because of DNS issues, if I gave more time for ZK to start then something would timeout and the ZK server would eventually start.
          Hide
          Mike Drob added a comment -

          Just did some log gathering on this. Out of my previous 167 builds, 20 have been unstable for this reason. So, not the 50% I originally claimed, but more like 15%.

          All 20 test failures only occurred on one slave, so it might be something up with that slave's configuration. 26 of the successful builds have been on that same host.

          Show
          Mike Drob added a comment - Just did some log gathering on this. Out of my previous 167 builds, 20 have been unstable for this reason. So, not the 50% I originally claimed, but more like 15%. All 20 test failures only occurred on one slave, so it might be something up with that slave's configuration. 26 of the successful builds have been on that same host.
          Hide
          Keith Turner added a comment -

          I asked Eric Newton about removing the failure when zookeeper does not start in a certain amount of time. He had two concerns :

          1. That it will make this type of issue harder to debug because you will not know why a test timed out. He said this could be handled by making MAC log a message still waiting for ZK to start, instead of throwing an exception.
          2. He mention that MAC is started in @Before and @BeforeClass methods which junit may not time out.

          Show
          Keith Turner added a comment - I asked Eric Newton about removing the failure when zookeeper does not start in a certain amount of time. He had two concerns : 1. That it will make this type of issue harder to debug because you will not know why a test timed out. He said this could be handled by making MAC log a message still waiting for ZK to start , instead of throwing an exception. 2. He mention that MAC is started in @Before and @BeforeClass methods which junit may not time out.
          Hide
          Eric Newton added a comment -

          Has not failed for days.

          Show
          Eric Newton added a comment - Has not failed for days.

            People

            • Assignee:
              Unassigned
              Reporter:
              Mike Drob
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development