Derby
  1. Derby
  2. DERBY-4319

hang in suites.all with ibm 1.5 on AIX after ttestDefaultProperties

    Details

    • Urgency:
      Urgent
    • Issue & fix info:
      High Value Fix
    • Bug behavior facts:
      Regression Test Failure

      Description

      The test run for 10.5.2.0 hung in suites.All. The console output (the run was with -Dderby.tests.trace=true) showed ttestDefaultProperties had successfully completed but the run was halted.
      ps -eaf | grep java showed the process that kicked off suites.All, and a networkserver process with the following flags:

      • classpath <classpath including derby.jar, derbytools.jar, derbyclient.jar, derbynet.jar, derbyTesting.jar, derbyrun.jar, derbyTesting.jar and junit.jar> -Dderby.drda.logConnections= -Dderby.drda.traceAll= -Dderby.drda.traceDirectory= -Dderby.drda.keepAlive= -Dderby.drda.timeSlice= -Dderby.drda.host= -Dderby.drda.portNumber= -derby.drda.minThreads= -Dderby.drda.maxThreads= -Dderby.drda.startNetworkServer= -Dderby.drda.debug= org.apache.derby.drda.NetworkServerControl start -h localhost -p 1527
        This process had been sitting for 2 days.
        After killing the NetworkServerControl process, the test continued successfully (except for DERBY-4186, fixed in trunk), but the following was put out to the console:
        START-SPAWNED:SpawnedNetworkServer STANDARD OUTPUT: exit code=137
        2009-07-18 03:16:07.157 GMT : Security manager installed using the Basic server
        security policy.
        2009-07-18 03:16:09.169 GMT : Apache Derby Network Server - 10.5.2.0 - (794445)
        started and ready to accept connections on port 1527
        END-SPAWNED :SpawnedNetworkServer STANDARD OUTPUT:
      1. TestProcess.javacore.20110310.123703.4390978.0001.txt
        367 kB
        Kathey Marsden
      2. TestOutput2011-03-09.txt
        10 kB
        Kathey Marsden
      3. LaunchedNetworkServerAfterPing.javacore.20110310.124948.6488248.0002.txt
        139 kB
        Kathey Marsden
      4. LaunchedNetworkServer.javacore.20110309.160148.6488248.0001.txt
        139 kB
        Kathey Marsden
      5. javacore.20090723.093909.24726.0001.txt
        131 kB
        Myrna van Lunteren
      6. javacore.20090723.093837.25380.0001.txt
        131 kB
        Myrna van Lunteren
      7. derby-4319_teardown_kill_on_bad_ping.txt
        2 kB
        Kathey Marsden
      8. derby-4319_disableSetPortPriorityAndDefaultProperties_diff.txt
        5 kB
        Kathey Marsden
      9. derby-4319_disable_setPortPriorty_diff.txt
        4 kB
        Kathey Marsden
      10. derby-4317_timeout_for_complete_diff.txt
        3 kB
        Kathey Marsden

        Issue Links

          Activity

          Hide
          Myrna van Lunteren added a comment -

          system/derby.log at the time of the hang had 1 start ( on port 1531) and 3 shutdowns.
          2009-07-18 03:15:51.865 GMT : Apache Derby Network Server - 10.5.2.0 - (794445)
          started and ready to accept connections on port 1531
          2009-07-18 03:15:58.988 GMT : Apache Derby Network Server - 10.5.2.0 - (794445)
          shutdown
          2009-07-18 03:15:59.101 GMT : Apache Derby Network Server - 10.5.2.0 - (794445)
          shutdown
          2009-07-18 03:15:59.257 GMT : Apache Derby Network Server - 10.5.2.0 - (794445)
          shutdown

          logs/serverConsoleOutput.log had this:
          2009-07-18 03:15:03.669 GMT : Could not connect to Derby Network Server on host
          127.0.0.1, port 1528: A remote host refused an attempted connect operation.
          2009-07-18 03:15:04.177 GMT : Apache Derby Network Server - 10.5.2.0 - (794445)
          started and ready to accept connections on port 1528
          2009-07-18 03:15:04.259 GMT : Apache Derby Network Server - 10.5.2.0 - (794445)
          shutdown
          2009-07-18 03:15:04.361 GMT : Could not connect to Derby Network Server on host
          127.0.0.1, port 1528: A remote host refused an attempted connect operation.
          2009-07-18 03:15:04.445 GMT : Could not connect to Derby Network Server on host
          127.0.0.1, port 1527: A remote host refused an attempted connect operation.
          2009-07-18 03:15:04.548 GMT : Could not connect to Derby Network Server on host
          127.0.0.1, port 1527: A remote host refused an attempted connect operation.
          2009-07-18 03:15:04.658 GMT : Could not connect to Derby Network Server on host
          127.0.0.1, port 1527: A remote host refused an attempted connect operation.
          2009-07-18 03:15:04.768 GMT : Could not connect to Derby Network Server on host
          127.0.0.1, port 1527: A remote host refused an attempted connect operation.
          2009-07-18 03:15:04.848 GMT : Apache Derby Network Server - 10.5.2.0 - (794445)
          started and ready to accept connections on port 1527
          2009-07-18 03:15:40.465 GMT : Apache Derby Network Server - 10.5.2.0 - (794445)
          shutdown

          I don't find any further helpful info.

          Show
          Myrna van Lunteren added a comment - system/derby.log at the time of the hang had 1 start ( on port 1531) and 3 shutdowns. 2009-07-18 03:15:51.865 GMT : Apache Derby Network Server - 10.5.2.0 - (794445) started and ready to accept connections on port 1531 2009-07-18 03:15:58.988 GMT : Apache Derby Network Server - 10.5.2.0 - (794445) shutdown 2009-07-18 03:15:59.101 GMT : Apache Derby Network Server - 10.5.2.0 - (794445) shutdown 2009-07-18 03:15:59.257 GMT : Apache Derby Network Server - 10.5.2.0 - (794445) shutdown logs/serverConsoleOutput.log had this: 2009-07-18 03:15:03.669 GMT : Could not connect to Derby Network Server on host 127.0.0.1, port 1528: A remote host refused an attempted connect operation. 2009-07-18 03:15:04.177 GMT : Apache Derby Network Server - 10.5.2.0 - (794445) started and ready to accept connections on port 1528 2009-07-18 03:15:04.259 GMT : Apache Derby Network Server - 10.5.2.0 - (794445) shutdown 2009-07-18 03:15:04.361 GMT : Could not connect to Derby Network Server on host 127.0.0.1, port 1528: A remote host refused an attempted connect operation. 2009-07-18 03:15:04.445 GMT : Could not connect to Derby Network Server on host 127.0.0.1, port 1527: A remote host refused an attempted connect operation. 2009-07-18 03:15:04.548 GMT : Could not connect to Derby Network Server on host 127.0.0.1, port 1527: A remote host refused an attempted connect operation. 2009-07-18 03:15:04.658 GMT : Could not connect to Derby Network Server on host 127.0.0.1, port 1527: A remote host refused an attempted connect operation. 2009-07-18 03:15:04.768 GMT : Could not connect to Derby Network Server on host 127.0.0.1, port 1527: A remote host refused an attempted connect operation. 2009-07-18 03:15:04.848 GMT : Apache Derby Network Server - 10.5.2.0 - (794445) started and ready to accept connections on port 1527 2009-07-18 03:15:40.465 GMT : Apache Derby Network Server - 10.5.2.0 - (794445) shutdown I don't find any further helpful info.
          Hide
          Myrna van Lunteren added a comment -

          This reproduces when I just run ServerPropertiesTest.
          The log/serverConsoleOutput.log must be from a different test - when I just run the ServerPropertiesTest it doesn't get created.
          I'll do some more experiments (newer jvm, redo 10.5.1.1, insane, latest 10.5 build) and report.

          Show
          Myrna van Lunteren added a comment - This reproduces when I just run ServerPropertiesTest. The log/serverConsoleOutput.log must be from a different test - when I just run the ServerPropertiesTest it doesn't get created. I'll do some more experiments (newer jvm, redo 10.5.1.1, insane, latest 10.5 build) and report.
          Hide
          Kathey Marsden added a comment -

          It would be useful to see the thread dump to see if this might be another shutdown issue or something different.

          Show
          Kathey Marsden added a comment - It would be useful to see the thread dump to see if this might be another shutdown issue or something different.
          Hide
          Myrna van Lunteren added a comment -

          I went back over my 10.5.1 testing notes, and I have seen this before - with 10.5.1.0 with IBM 1.6 SR4 (and I thought it was a jvm issue as I didn't see it that time with IBM 1.5); then with 10.5.1.1 I saw it with IBM 1.5 and IBM 1.6 SR3, but I couldn't reproduce anymore after I rebooted the machine...

          Show
          Myrna van Lunteren added a comment - I went back over my 10.5.1 testing notes, and I have seen this before - with 10.5.1.0 with IBM 1.6 SR4 (and I thought it was a jvm issue as I didn't see it that time with IBM 1.5); then with 10.5.1.1 I saw it with IBM 1.5 and IBM 1.6 SR3, but I couldn't reproduce anymore after I rebooted the machine...
          Hide
          Myrna van Lunteren added a comment -

          Attaching the javacore files obtained with kill -QUIT of the test hung (with just running the 1 test).
          25380 was the pid of the NetworkServerControl start process;
          24726 was the pid of the junit.textui.TestRunner ...ServerPropertiesTest process.

          Show
          Myrna van Lunteren added a comment - Attaching the javacore files obtained with kill -QUIT of the test hung (with just running the 1 test). 25380 was the pid of the NetworkServerControl start process; 24726 was the pid of the junit.textui.TestRunner ...ServerPropertiesTest process.
          Hide
          Myrna van Lunteren added a comment -

          I ran the test with 10.4.2.0 and IBM 1.5, and it hung too!
          Then I couldn't log on anymore, and booted the machine, and now there is no more hang.

          I'd blame it on the machine but it's odd that it's only in this particular test...
          No one else works on this machine, there are no long-running processes...

          Show
          Myrna van Lunteren added a comment - I ran the test with 10.4.2.0 and IBM 1.5, and it hung too! Then I couldn't log on anymore, and booted the machine, and now there is no more hang. I'd blame it on the machine but it's odd that it's only in this particular test... No one else works on this machine, there are no long-running processes...
          Hide
          Myrna van Lunteren added a comment -

          I did not see this during the 10.5.3.0 test run.
          (doesn't mean it's gone, I don't think there was a related change between 10.5.2.0 and 10.5.3.0).

          Show
          Myrna van Lunteren added a comment - I did not see this during the 10.5.3.0 test run. (doesn't mean it's gone, I don't think there was a related change between 10.5.2.0 and 10.5.3.0).
          Hide
          Myrna van Lunteren added a comment -

          I've tried to get this error, but it's not reproducing for me.
          If I see it again, I can reopen and attempt further research.

          Show
          Myrna van Lunteren added a comment - I've tried to get this error, but it's not reproducing for me. If I see it again, I can reopen and attempt further research.
          Hide
          Myrna van Lunteren added a comment -

          I saw this while testing 10.6.2.1, insane jars, once with ibm15 (SR12, FP1) (passed second run) and once with ibm16 (SR8, FP1), running suites.All with -Xmx512M -Xms512M (and -Dderby.tests.trace=true).

          I'lll try to gather more info.

          Show
          Myrna van Lunteren added a comment - I saw this while testing 10.6.2.1, insane jars, once with ibm15 (SR12, FP1) (passed second run) and once with ibm16 (SR8, FP1), running suites.All with -Xmx512M -Xms512M (and -Dderby.tests.trace=true). I'lll try to gather more info.
          Hide
          Kathey Marsden added a comment -

          Triage for 10.8. Marking urgent as it is generally not good to have hanging tests that prevent the rest of the test run from completing.

          Show
          Kathey Marsden added a comment - Triage for 10.8. Marking urgent as it is generally not good to have hanging tests that prevent the rest of the test run from completing.
          Hide
          Kathey Marsden added a comment -

          I ran the test and saw it hang on AIX.
          When I think tried to ping the server manually once hung I saw a connection reset error:

          $ java org.apache.derby.drda.NetworkServerControl ping
          Mon Feb 28 14:52:49 PST 2011 : Error on client socket:
          Connection reset
          Mon Feb 28 14:52:49 PST 2011 : Connection reset
          java.net.SocketException: Connection reset
          at java.net.SocketInputStream.read(SocketInputStream.java:197)
          at java.net.SocketInputStream.read(SocketInputStream.java:116)
          at org.apache.derby.impl.drda.NetworkServerControlImpl.fillReplyBuffer(N
          etworkServerControlImpl.java:2873)
          at org.apache.derby.impl.drda.NetworkServerControlImpl.readResult(Networ
          kServerControlImpl.java:2817)
          at org.apache.derby.impl.drda.NetworkServerControlImpl.pingWithNoOpen(Ne
          tworkServerControlImpl.java:1253)
          at org.apache.derby.impl.drda.NetworkServerControlImpl.ping(NetworkServe
          rControlImpl.java:1228)
          at org.apache.derby.impl.drda.NetworkServerControlImpl.executeWork(Netwo
          rkServerControlImpl.java:2260)
          at org.apache.derby.drda.NetworkServerControl.main(NetworkServerControl.
          java:320)

          After that occurred, any attempt to ping or shutdown the server hung and looking at the javacore, the ClientThread was no longer running

          Since in the original trace from the hang was in SpawnedProcess.complete() called from NetworkServerTestSetup.teardown() this is what I think has happened:

          In NetworkServerTestSetup.tearDown we have
          if (networkServerController != null) {
          boolean running = false;
          try

          { networkServerController.ping(); running = true; }

          catch (Exception e) {
          }

          Assuming the ping returned the connection reset, even though the process was still running it made teardown think that the server was actually down. It did not attempt to shutdown, but called spawnedServer.complete(failedShutdown != null); with false as its argument so did not try to destroy the process either, so remains hung waiting for the process to enter.

          There seem to be a few issues here.
          1) How does the server get into this state for this particular test?
          2) How can we ensure that the server is brought down or destroyed no matter what?

          I think I will focus on the second aspect first, so we don't have the risk of full runs getting held up and then try to understand the root cause for the network server state after that.

          Show
          Kathey Marsden added a comment - I ran the test and saw it hang on AIX. When I think tried to ping the server manually once hung I saw a connection reset error: $ java org.apache.derby.drda.NetworkServerControl ping Mon Feb 28 14:52:49 PST 2011 : Error on client socket: Connection reset Mon Feb 28 14:52:49 PST 2011 : Connection reset java.net.SocketException: Connection reset at java.net.SocketInputStream.read(SocketInputStream.java:197) at java.net.SocketInputStream.read(SocketInputStream.java:116) at org.apache.derby.impl.drda.NetworkServerControlImpl.fillReplyBuffer(N etworkServerControlImpl.java:2873) at org.apache.derby.impl.drda.NetworkServerControlImpl.readResult(Networ kServerControlImpl.java:2817) at org.apache.derby.impl.drda.NetworkServerControlImpl.pingWithNoOpen(Ne tworkServerControlImpl.java:1253) at org.apache.derby.impl.drda.NetworkServerControlImpl.ping(NetworkServe rControlImpl.java:1228) at org.apache.derby.impl.drda.NetworkServerControlImpl.executeWork(Netwo rkServerControlImpl.java:2260) at org.apache.derby.drda.NetworkServerControl.main(NetworkServerControl. java:320) After that occurred, any attempt to ping or shutdown the server hung and looking at the javacore, the ClientThread was no longer running Since in the original trace from the hang was in SpawnedProcess.complete() called from NetworkServerTestSetup.teardown() this is what I think has happened: In NetworkServerTestSetup.tearDown we have if (networkServerController != null) { boolean running = false; try { networkServerController.ping(); running = true; } catch (Exception e) { } Assuming the ping returned the connection reset, even though the process was still running it made teardown think that the server was actually down. It did not attempt to shutdown, but called spawnedServer.complete(failedShutdown != null); with false as its argument so did not try to destroy the process either, so remains hung waiting for the process to enter. There seem to be a few issues here. 1) How does the server get into this state for this particular test? 2) How can we ensure that the server is brought down or destroyed no matter what? I think I will focus on the second aspect first, so we don't have the risk of full runs getting held up and then try to understand the root cause for the network server state after that.
          Hide
          Kathey Marsden added a comment -

          The attached patch changes NetworkServerControl.tearDown() to print the stack trace and destroy the server process if the test ping fails to some error other than the standard Cannot Connect error that happens if the server is not up. Before the patch if the ping failed for any reason the assumption was that the server was down.

          This doesn't eliminate the possibility for a hang entirely as if the test ping itself were to hang, then we would still have the problem.

          Note this patch does not get at the root cause of the Connection reset error on aix, but will hopefully get the tests to finish their run. Currently if the the process destruction is required, it won't result in the failure of a test, just the print of the stacktrace to system.out.

          Show
          Kathey Marsden added a comment - The attached patch changes NetworkServerControl.tearDown() to print the stack trace and destroy the server process if the test ping fails to some error other than the standard Cannot Connect error that happens if the server is not up. Before the patch if the ping failed for any reason the assumption was that the server was down. This doesn't eliminate the possibility for a hang entirely as if the test ping itself were to hang, then we would still have the problem. Note this patch does not get at the root cause of the Connection reset error on aix, but will hopefully get the tests to finish their run. Currently if the the process destruction is required, it won't result in the failure of a test, just the print of the stacktrace to system.out.
          Hide
          Kathey Marsden added a comment -

          Well this patch isn't quite right yet for the german tests, but I'll fix that up.

          Show
          Kathey Marsden added a comment - Well this patch isn't quite right yet for the german tests, but I'll fix that up.
          Hide
          Kathey Marsden added a comment -

          Well I want to try to get something checked in to get the tests to complete on aix before we cut the release candidate. I believe the attached patch which adds a timeout to the SpawnedProcess.complete() method should do that and should do no harm. I am still not totally happy with it and unfortunately the aix machine does not feel like hanging right now. I am running suites.All on Windows and AIX now and will checkin before tomorrow mornings rc if all goes well.

          Show
          Kathey Marsden added a comment - Well I want to try to get something checked in to get the tests to complete on aix before we cut the release candidate. I believe the attached patch which adds a timeout to the SpawnedProcess.complete() method should do that and should do no harm. I am still not totally happy with it and unfortunately the aix machine does not feel like hanging right now. I am running suites.All on Windows and AIX now and will checkin before tomorrow mornings rc if all goes well.
          Hide
          Kathey Marsden added a comment -

          I got an out of memory error on the aix run. I don't know that it is related to this change but won't chance it. I am sorry it won't make it by tomorrow morning.

          Show
          Kathey Marsden added a comment - I got an out of memory error on the aix run. I don't know that it is related to this change but won't chance it. I am sorry it won't make it by tomorrow morning.
          Hide
          Kathey Marsden added a comment -

          It ended up that aix has always required -Xmx512M which I added and got through the run. I did two runs on AIX and two on windows getting different known intermittent failures each time, but none I think related to this change. Committed revision 1079548.

          Show
          Kathey Marsden added a comment - It ended up that aix has always required -Xmx512M which I added and got through the run. I did two runs on AIX and two on windows getting different known intermittent failures each time, but none I think related to this change. Committed revision 1079548.
          Hide
          Kathey Marsden added a comment -

          There was some network maintenance that seems to be what got the aix machine out of its problematic state. Myrna said in the past after reboot (this time it wasn't reboot but reboot of the router) it will work for a while and then start to consistently fail. I kicked of fifty runs of the derbynet suite to see if I can get the machine back into the problematic state.

          Show
          Kathey Marsden added a comment - There was some network maintenance that seems to be what got the aix machine out of its problematic state. Myrna said in the past after reboot (this time it wasn't reboot but reboot of the router) it will work for a while and then start to consistently fail. I kicked of fifty runs of the derbynet suite to see if I can get the machine back into the problematic state.
          Hide
          Kathey Marsden added a comment -

          I have the machine back in the state where this reproduces and am sorry to say that there is still a hang in a different method, even with my prior attempt to get past it, but since I can reproduce now, I should be able to make some progress on this issue. I'll record some info here in case it becomes hard to reproduce again.

          The current state of hang is that the launched network server process which seems to specify all the drda parameters without values:
          cloudtst 6488248 4390978 0 14:41:38 - 0:20 /local1/IBM_JDK/15sr13/sdk/jr
          e/bin/java -classpath /local1/kmarsden/repro/derby-4319/jars//derby.jar:/local1/
          kmarsden/repro/derby-4319/jars//derbyrun.jar:/local1/kmarsden/repro/derby-4319/j
          ars//derbyTesting.jar:/local1/kmarsden/repro/derby-4319/jars//junit.jar -Dderby.
          drda.logConnections= -Dderby.drda.traceAll= -Dderby.drda.traceDirectory= -Dderby
          .drda.keepAlive= -Dderby.drda.timeSlice= -Dderby.drda.host= -Dderby.drda.portNum
          ber= -Dderby.drda.minThreads= -Dderby.drda.maxThreads= -Dderby.drda.startNetwork
          Server= -Dderby.drda.debug= org.apache.derby.drda.NetworkServerControl start -h
          localhost -p 1527

          I will attach the javacore with thread dump as LaunchedNetworkServer.javacore.20110309.160148.6488248.0001.txt

          The server threads look pretty normal with a ClientThread running waiting to accept requests.

          The test process is hung in NetworkServerTestSetup.complete(). I am not sure if it is later or if the change I made just did not work. I will attach the test process file as:
          TestProcess.javacore.20110310.123703.4390978.0001.txt

          If I try to ping the server from the command line I get a ConnectionReset error:
          $ java org.apache.derby.drda.NetworkServerControl ping
          Thu Mar 10 12:47:39 PST 2011 : Error on client socket:
          Connection reset
          Thu Mar 10 12:47:39 PST 2011 : Connection reset
          java.net.SocketException: Connection reset
          at java.net.SocketInputStream.read(SocketInputStream.java:197)
          at java.net.SocketInputStream.read(SocketInputStream.java:116)
          at org.apache.derby.impl.drda.NetworkServerControlImpl.fillReplyBuffer(N
          etworkServerControlImpl.java:2873)
          at org.apache.derby.impl.drda.NetworkServerControlImpl.readResult(Networ
          kServerControlImpl.java:2817)
          at org.apache.derby.impl.drda.NetworkServerControlImpl.pingWithNoOpen(Ne
          tworkServerControlImpl.java:1253)
          at org.apache.derby.impl.drda.NetworkServerControlImpl.ping(NetworkServe
          rControlImpl.java:1228)
          at org.apache.derby.impl.drda.NetworkServerControlImpl.executeWork(Netwo
          rkServerControlImpl.java:2260)
          at org.apache.derby.drda.NetworkServerControl.main(NetworkServerControl.
          java:320)

          Then after that subsequent ping attempts hang and a new thread dump on the Network Server process shows that the ClientThread is no longer there. I think this should never happen. I think a lot of work has been put into making sure that the ClientThread always survives any type of error in order host more connections. see attachment LaunchedNetworkServerAfterPing.javacore.20110310.124948.6488248.0002.txt

          Another thing to note is that prior to the defaultProperties test there was actually a stack trace in the setPortPriorty test with a Connection reset which did not cause failure. see TestOutput2011-03-09.txt .out

          This issue actually has many facets that are worth working on:

          1) How do we make sure a spawned network server process is destroyed if it hangs the whole suite?

          2) Under what circumstances can the Network Server ClientThread that loops accepting new connections be destroyed?

          3) What sort of problem is being caused on AIX by starting network server with these odd options? I am thinking maybe it is related to soTimeout or keepalive getting set to an unexpected option but am not sure.

          I have been holding off on working on 3, because it provides a good reproduction for issue one and two but think that at this point, the best thing to do would be to disable the problematic fixture on AIX whether it is testSetpPortPriority or testDefaultProperties. Then I can work on all three issues in a logical order and pace without release concerns. I'll look into doing that.

          Show
          Kathey Marsden added a comment - I have the machine back in the state where this reproduces and am sorry to say that there is still a hang in a different method, even with my prior attempt to get past it, but since I can reproduce now, I should be able to make some progress on this issue. I'll record some info here in case it becomes hard to reproduce again. The current state of hang is that the launched network server process which seems to specify all the drda parameters without values: cloudtst 6488248 4390978 0 14:41:38 - 0:20 /local1/IBM_JDK/15sr13/sdk/jr e/bin/java -classpath /local1/kmarsden/repro/derby-4319/jars//derby.jar:/local1/ kmarsden/repro/derby-4319/jars//derbyrun.jar:/local1/kmarsden/repro/derby-4319/j ars//derbyTesting.jar:/local1/kmarsden/repro/derby-4319/jars//junit.jar -Dderby. drda.logConnections= -Dderby.drda.traceAll= -Dderby.drda.traceDirectory= -Dderby .drda.keepAlive= -Dderby.drda.timeSlice= -Dderby.drda.host= -Dderby.drda.portNum ber= -Dderby.drda.minThreads= -Dderby.drda.maxThreads= -Dderby.drda.startNetwork Server= -Dderby.drda.debug= org.apache.derby.drda.NetworkServerControl start -h localhost -p 1527 I will attach the javacore with thread dump as LaunchedNetworkServer.javacore.20110309.160148.6488248.0001.txt The server threads look pretty normal with a ClientThread running waiting to accept requests. The test process is hung in NetworkServerTestSetup.complete(). I am not sure if it is later or if the change I made just did not work. I will attach the test process file as: TestProcess.javacore.20110310.123703.4390978.0001.txt If I try to ping the server from the command line I get a ConnectionReset error: $ java org.apache.derby.drda.NetworkServerControl ping Thu Mar 10 12:47:39 PST 2011 : Error on client socket: Connection reset Thu Mar 10 12:47:39 PST 2011 : Connection reset java.net.SocketException: Connection reset at java.net.SocketInputStream.read(SocketInputStream.java:197) at java.net.SocketInputStream.read(SocketInputStream.java:116) at org.apache.derby.impl.drda.NetworkServerControlImpl.fillReplyBuffer(N etworkServerControlImpl.java:2873) at org.apache.derby.impl.drda.NetworkServerControlImpl.readResult(Networ kServerControlImpl.java:2817) at org.apache.derby.impl.drda.NetworkServerControlImpl.pingWithNoOpen(Ne tworkServerControlImpl.java:1253) at org.apache.derby.impl.drda.NetworkServerControlImpl.ping(NetworkServe rControlImpl.java:1228) at org.apache.derby.impl.drda.NetworkServerControlImpl.executeWork(Netwo rkServerControlImpl.java:2260) at org.apache.derby.drda.NetworkServerControl.main(NetworkServerControl. java:320) Then after that subsequent ping attempts hang and a new thread dump on the Network Server process shows that the ClientThread is no longer there. I think this should never happen. I think a lot of work has been put into making sure that the ClientThread always survives any type of error in order host more connections. see attachment LaunchedNetworkServerAfterPing.javacore.20110310.124948.6488248.0002.txt Another thing to note is that prior to the defaultProperties test there was actually a stack trace in the setPortPriorty test with a Connection reset which did not cause failure. see TestOutput2011-03-09.txt .out This issue actually has many facets that are worth working on: 1) How do we make sure a spawned network server process is destroyed if it hangs the whole suite? 2) Under what circumstances can the Network Server ClientThread that loops accepting new connections be destroyed? 3) What sort of problem is being caused on AIX by starting network server with these odd options? I am thinking maybe it is related to soTimeout or keepalive getting set to an unexpected option but am not sure. I have been holding off on working on 3, because it provides a good reproduction for issue one and two but think that at this point, the best thing to do would be to disable the problematic fixture on AIX whether it is testSetpPortPriority or testDefaultProperties. Then I can work on all three issues in a logical order and pace without release concerns. I'll look into doing that.
          Hide
          Kathey Marsden added a comment -

          Attaching files per previous comment.

          Show
          Kathey Marsden added a comment - Attaching files per previous comment.
          Hide
          Kathey Marsden added a comment - - edited

          I am looking at disabling this test and see that currently we have methods like isSunJVM() isIBMJVM() etc are in BaseTestCase. I will be adding isAIXPlatform() isJava5() and with other issues for disabling tests there will probably be isMacPlatform() etc. Is BaseTestCase() the right place for these kind of methods?

          For the platforms, one option might be to have new class OsName.java that has String constants for all the os.name values and then just have an isPlatform(String) method in BaseTestCase. (sorry accidentally put this in DERBY-5096 at first.)

          Show
          Kathey Marsden added a comment - - edited I am looking at disabling this test and see that currently we have methods like isSunJVM() isIBMJVM() etc are in BaseTestCase. I will be adding isAIXPlatform() isJava5() and with other issues for disabling tests there will probably be isMacPlatform() etc. Is BaseTestCase() the right place for these kind of methods? For the platforms, one option might be to have new class OsName.java that has String constants for all the os.name values and then just have an isPlatform(String) method in BaseTestCase. (sorry accidentally put this in DERBY-5096 at first.)
          Hide
          Kathey Marsden added a comment -

          The Attached patch derby-4319_disable_SetPortPriorty_diff.txt disables the ttestsetPortPriorityTest on AIX JDK 1.5 which seems to be the test that get us into a state where we might hang although the actual hang happens in the subsequent test ttestDefaultProperties.

          I added the OsName class with the os.name values that I know. Still missing are Solaris, HP, and I am not sure on Windows if Windows 7, 2008 etc have different variations, but this is a start.

          I am looping derbynet on AIX now and doing a suites.All run on Windows.

          Show
          Kathey Marsden added a comment - The Attached patch derby-4319_disable_SetPortPriorty_diff.txt disables the ttestsetPortPriorityTest on AIX JDK 1.5 which seems to be the test that get us into a state where we might hang although the actual hang happens in the subsequent test ttestDefaultProperties. I added the OsName class with the os.name values that I know. Still missing are Solaris, HP, and I am not sure on Windows if Windows 7, 2008 etc have different variations, but this is a start. I am looping derbynet on AIX now and doing a suites.All run on Windows.
          Hide
          Kathey Marsden added a comment -

          I ended up having to disable both ttestSetPortPriority and ttestDefaultProperties. With this patch I got through 35 runs of derbynet on AIX. I'll check this in today.

          Show
          Kathey Marsden added a comment - I ended up having to disable both ttestSetPortPriority and ttestDefaultProperties. With this patch I got through 35 runs of derbynet on AIX. I'll check this in today.
          Hide
          Myrna van Lunteren added a comment -

          I now saw the hang with ibm 1.6 SR9 FP1 (during the 10.8.1.1 platform tests).
          I guess we should stop them from running with ibm 1.6 also.

          Show
          Myrna van Lunteren added a comment - I now saw the hang with ibm 1.6 SR9 FP1 (during the 10.8.1.1 platform tests). I guess we should stop them from running with ibm 1.6 also.
          Hide
          Knut Anders Hatlen added a comment - - edited

          I'm wondering if this could have the same root cause as DERBY-5192 (fixed after 10.8.1.1). That could prevent server shutdown from terminating the server, which would leave the server process running forever. Since NetworkServerTestSetup waits until the external process terminates, the test will just hang there. What Kathey observed when she pinged the hanging server on AIX is also consistent with what I saw when I pinged a server that hung because of DERBY-5192 (see comment https://issues.apache.org/jira/browse/DERBY-5192?focusedCommentId=13020264&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13020264 ).

          Show
          Knut Anders Hatlen added a comment - - edited I'm wondering if this could have the same root cause as DERBY-5192 (fixed after 10.8.1.1). That could prevent server shutdown from terminating the server, which would leave the server process running forever. Since NetworkServerTestSetup waits until the external process terminates, the test will just hang there. What Kathey observed when she pinged the hanging server on AIX is also consistent with what I saw when I pinged a server that hung because of DERBY-5192 (see comment https://issues.apache.org/jira/browse/DERBY-5192?focusedCommentId=13020264&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13020264 ).
          Hide
          Myrna van Lunteren added a comment -

          Thanks for that comment; I saw the link, but hadn't noticed there was a fix. I was contemplating skipping running the 10.8.1.2 RC on this OS, but now I'll at least run the 1.6 with it, and uncomment the skipped tests and run with ibm 1.5.

          Show
          Myrna van Lunteren added a comment - Thanks for that comment; I saw the link, but hadn't noticed there was a fix. I was contemplating skipping running the 10.8.1.2 RC on this OS, but now I'll at least run the 1.6 with it, and uncomment the skipped tests and run with ibm 1.5.
          Hide
          Myrna van Lunteren added a comment -

          I got the hang again with 10.7.1.1 and then tested same with 10.8.1.2 (with modified derbyTesting.jar) and indeed, this appears now fixed. I reinstated the testing with revision 1096875 on trunk, and with revision 1096895 on 10.8.

          Show
          Myrna van Lunteren added a comment - I got the hang again with 10.7.1.1 and then tested same with 10.8.1.2 (with modified derbyTesting.jar) and indeed, this appears now fixed. I reinstated the testing with revision 1096875 on trunk, and with revision 1096895 on 10.8.
          Hide
          Kathey Marsden added a comment -

          I marked this derby_backport_reject_10_5, not because of any risk but this issue has just test changes made during the diagnosis of this issue. The real fix it seems came under DERBY-5192, so there is no real value to backporting to 10.5

          If someone ever wants these changes in older releases for some reason it should be fine to do so.

          Show
          Kathey Marsden added a comment - I marked this derby_backport_reject_10_5, not because of any risk but this issue has just test changes made during the diagnosis of this issue. The real fix it seems came under DERBY-5192 , so there is no real value to backporting to 10.5 If someone ever wants these changes in older releases for some reason it should be fine to do so.
          Hide
          Knut Anders Hatlen added a comment -

          [bulk update] Close all resolved issues that haven't been updated for more than one year.

          Show
          Knut Anders Hatlen added a comment - [bulk update] Close all resolved issues that haven't been updated for more than one year.

            People

            • Assignee:
              Kathey Marsden
              Reporter:
              Myrna van Lunteren
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development