Details

    • Type: Test Test
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Currently, if tests hang in hudson it can go hung for days until we manually kill it.

      The problem is that when a hang happens its probably serious, what we want to do (I think), is:

      1. time out the build.
      2. ensure we have enough debugging information to hopefully fix any hang.

      So I think the ideal solution would be:

      1. add a sysprop "-D" that LuceneTestCase respects, it could default to no timeout at all (some value like zero).
      2. when a timeout is set, LuceneTestCase spawns an additional timer thread for the test class? method?
      3. if the timeout is exceeded, LuceneTestCase dumps all thread/stack information, random seed information to hopefully reproduce the hang, and fails the test.
      4. nightly builds would pass some reasonable -D for each test.

      separately, I think we should have an "ant-level" timeout for the whole build, in case it goes completely crazy (e.g. jvm completely hangs or something else), just as an additional safety.

      1. LUCENE-3429.patch
        8 kB
        Dawid Weiss
      2. LUCENE-3429.patch
        1 kB
        Robert Muir

        Issue Links

          Activity

          Robert Muir created issue -
          Hide
          Robert Muir added a comment -

          I'm gonna play with the ant junit task timeout first, just to see if we can do anything with it as a quick hack.

          I suspect the problem will be that we won't get enough debugging information via this mechanism (random seed, stacktraces).

          Show
          Robert Muir added a comment - I'm gonna play with the ant junit task timeout first, just to see if we can do anything with it as a quick hack. I suspect the problem will be that we won't get enough debugging information via this mechanism (random seed, stacktraces).
          Hide
          Robert Muir added a comment -

          here is a hack patch that sets a timeout of 1 hour to any test batch (e.g. test-core) by default, unless you are running Test2BTerms (10 hours).

          i tested this, the issue is you get no debugging information at all... but its at least a small start.

          Show
          Robert Muir added a comment - here is a hack patch that sets a timeout of 1 hour to any test batch (e.g. test-core) by default, unless you are running Test2BTerms (10 hours). i tested this, the issue is you get no debugging information at all... but its at least a small start.
          Robert Muir made changes -
          Field Original Value New Value
          Attachment LUCENE-3429.patch [ 12493960 ]
          Hide
          Michael McCandless added a comment -

          We could run a standalone tool that does a kill -QUIT if any java process is taking > X minutes?

          Show
          Michael McCandless added a comment - We could run a standalone tool that does a kill -QUIT if any java process is taking > X minutes?
          Hide
          Robert Muir added a comment -

          Mike, right but even that solution wouldn't be that great: it wouldn't give us random seed

          Dawid pointed me to some code of his, I think he is working on a prototype for us to try to integrate:

          https://github.com/dweiss/timeoutrule/tree/master/src/test/java/com/carrotsearch

          Show
          Robert Muir added a comment - Mike, right but even that solution wouldn't be that great: it wouldn't give us random seed Dawid pointed me to some code of his, I think he is working on a prototype for us to try to integrate: https://github.com/dweiss/timeoutrule/tree/master/src/test/java/com/carrotsearch
          Hide
          Hoss Man added a comment -

          separately, I think we should have an "ant-level" timeout for the whole build, in case it goes completely crazy (e.g. jvm completely hangs or something else), just as an additional safety.

          jenkins's has a build option to handle this part (no personal experience with it though)

          Dawid pointed me to some code of his, ...

          A per test annotation definitely seems like the killer solution.

          Show
          Hoss Man added a comment - separately, I think we should have an "ant-level" timeout for the whole build, in case it goes completely crazy (e.g. jvm completely hangs or something else), just as an additional safety. jenkins's has a build option to handle this part (no personal experience with it though) Dawid pointed me to some code of his, ... A per test annotation definitely seems like the killer solution.
          Hide
          Dawid Weiss added a comment -

          I wrote that snippet yesterday as a proof of concept - it is simple and I think does the job (you can control timeouts of all tests of a class, a single test's timeout time or, in case of Lucene, global timeouts on all tests simply by putting the rule in the superclass of all tests).

          In fact, JUnit4 already has two built-in options for doing timeouts: a @Timeout rule much like mine (but not logging object's internal fields) and @Test(timeout=...). My implementation simple expands on this by adding a "live" dump of objects being tested (important in case of live fields) – sending a signal (or ctrl-break combination on windows) will dump the spinning test's fields to the console.

          I can prepare a patch, but it's actually trivial to just take the rule's code from github and paste in LuceneTestCase. Should work out of the box, perpahs with changes related to the fact that I used commons-lang for dumping the target object and we may simply restrict it to dumping the current seed.

          Show
          Dawid Weiss added a comment - I wrote that snippet yesterday as a proof of concept - it is simple and I think does the job (you can control timeouts of all tests of a class, a single test's timeout time or, in case of Lucene, global timeouts on all tests simply by putting the rule in the superclass of all tests). In fact, JUnit4 already has two built-in options for doing timeouts: a @Timeout rule much like mine (but not logging object's internal fields) and @Test(timeout=...). My implementation simple expands on this by adding a "live" dump of objects being tested (important in case of live fields) – sending a signal (or ctrl-break combination on windows) will dump the spinning test's fields to the console. I can prepare a patch, but it's actually trivial to just take the rule's code from github and paste in LuceneTestCase. Should work out of the box, perpahs with changes related to the fact that I used commons-lang for dumping the target object and we may simply restrict it to dumping the current seed.
          Hide
          Dawid Weiss added a comment -

          Suggested patch introducing method-level timeouts and seed/details reporting.

          Show
          Dawid Weiss added a comment - Suggested patch introducing method-level timeouts and seed/details reporting.
          Dawid Weiss made changes -
          Attachment LUCENE-3429.patch [ 12494104 ]
          Dawid Weiss made changes -
          Attachment LUCENE-3429.patch [ 12494104 ]
          Hide
          Dawid Weiss added a comment -

          Updated slightly to avoid calling reportAdditionalFailureInfo before an actual timeout is hit.

          Show
          Dawid Weiss added a comment - Updated slightly to avoid calling reportAdditionalFailureInfo before an actual timeout is hit.
          Dawid Weiss made changes -
          Attachment LUCENE-3429.patch [ 12494115 ]
          Hide
          Dawid Weiss added a comment -

          Errr... I get a repetitive vm crash after applying the above patch.

              [junit] #
              [junit] # A fatal error has been detected by the Java Runtime Environment:
              [junit] #
              [junit] #  EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x000000006dae29a5, pid=4040, tid=5104
              [junit] #
              [junit] # JRE version: 6.0_26-b03
              [junit] # Java VM: Java HotSpot(TM) 64-Bit Server VM (20.1-b02 mixed mode windows-amd64 compressed oops)
              [junit] # Problematic frame:
              [junit] # V  [jvm.dll+0x2529a5]
              [junit] #
              [junit] # An error report file with more information is saved as:
              [junit] # d:\work\apache.org\lucene.git\lucene\build\test\7\hs_err_pid4040.log
              [junit] #
              [junit] # If you would like to submit a bug report, please visit:
              [junit] #   http://java.sun.com/webapps/bugreport/crash.jsp
              [junit] #
              [junit] Testsuite: org.Batch-With-Multiple-Tests
              [junit] Testcase: org.Batch-With-Multiple-Tests:testFlushExceptions:        Caused an ERROR
              [junit] Forked Java VM exited abnormally. Please note the time in the report does not reflect the time until the VM exit.
              [junit] junit.framework.AssertionFailedError: Forked Java VM exited abnormally. Please note the time in the report does not reflect the time until the VM ex
          it.
              [junit]     at java.lang.Thread.run(Thread.java:662)
              [junit]
          

          Is this something new or known? I'm on win7, 64-bit - can anybody check on other machines?

          Show
          Dawid Weiss added a comment - Errr... I get a repetitive vm crash after applying the above patch. [junit] # [junit] # A fatal error has been detected by the Java Runtime Environment: [junit] # [junit] # EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x000000006dae29a5, pid=4040, tid=5104 [junit] # [junit] # JRE version: 6.0_26-b03 [junit] # Java VM: Java HotSpot(TM) 64-Bit Server VM (20.1-b02 mixed mode windows-amd64 compressed oops) [junit] # Problematic frame: [junit] # V [jvm.dll+0x2529a5] [junit] # [junit] # An error report file with more information is saved as: [junit] # d:\work\apache.org\lucene.git\lucene\build\test\7\hs_err_pid4040.log [junit] # [junit] # If you would like to submit a bug report, please visit: [junit] # http://java.sun.com/webapps/bugreport/crash.jsp [junit] # [junit] Testsuite: org.Batch-With-Multiple-Tests [junit] Testcase: org.Batch-With-Multiple-Tests:testFlushExceptions: Caused an ERROR [junit] Forked Java VM exited abnormally. Please note the time in the report does not reflect the time until the VM exit. [junit] junit.framework.AssertionFailedError: Forked Java VM exited abnormally. Please note the time in the report does not reflect the time until the VM ex it. [junit] at java.lang.Thread.run(Thread.java:662) [junit] Is this something new or known? I'm on win7, 64-bit - can anybody check on other machines?
          Hide
          Robert Muir added a comment -

          Does this jre crash occur when you interrupt() a thread?

          Show
          Robert Muir added a comment - Does this jre crash occur when you interrupt() a thread?
          Hide
          Dawid Weiss added a comment -

          I didn't check and this should be a no-issue, even when you call stop() because stop() is supposed to throw a ThreadDeathError at the current execution point. interrupt() doesn't do anything other than setting a thread flag (it does interrupt i/o and monitors, but not busy loops). I don't think it's what we want.

          I didn't check on linux/ macos yet, didn't have time.

          Show
          Dawid Weiss added a comment - I didn't check and this should be a no-issue, even when you call stop() because stop() is supposed to throw a ThreadDeathError at the current execution point. interrupt() doesn't do anything other than setting a thread flag (it does interrupt i/o and monitors, but not busy loops). I don't think it's what we want. I didn't check on linux/ macos yet, didn't have time.
          Hide
          Robert Muir added a comment -

          I only mention it because interrupting threads in our tests caused jre crashes before, LuceneTestCase:784

          Show
          Robert Muir added a comment - I only mention it because interrupting threads in our tests caused jre crashes before, LuceneTestCase:784
          Hide
          Dawid Weiss added a comment -

          Right.... Interesting. Ok, I'll dig later on.

          Show
          Dawid Weiss added a comment - Right.... Interesting. Ok, I'll dig later on.
          Hide
          Dawid Weiss added a comment -

          The tests with this patch hang on my mac and on linux box. On windows they crash the jvm... oh boy, what a joy.

          Show
          Dawid Weiss added a comment - The tests with this patch hang on my mac and on linux box. On windows they crash the jvm... oh boy, what a joy.
          Hide
          Simon Willnauer added a comment -

          david, I looked at your patch briefly and I think what you are trying to do is unfortunately not possible with java and its thread implementation. And actually I think it should not be possible too. Calling Thread.stop() is deprecated for a very good reason since it can put your Thread into an undefined state. I think what we should do instead is get all threads stack traces (Thread#getAllStackTraces()) and then interrupt the threads we obtained. if a thread doesn't react to interrupt() it won't react to stop() either afaik.

          Show
          Simon Willnauer added a comment - david, I looked at your patch briefly and I think what you are trying to do is unfortunately not possible with java and its thread implementation. And actually I think it should not be possible too. Calling Thread.stop() is deprecated for a very good reason since it can put your Thread into an undefined state. I think what we should do instead is get all threads stack traces (Thread#getAllStackTraces()) and then interrupt the threads we obtained. if a thread doesn't react to interrupt() it won't react to stop() either afaik.
          Hide
          Dawid Weiss added a comment -

          Not really. Thread.stop is deprecated and I know the reasons it is deprecated, but this doesn't mean it doesn't work. interrupt() is not a solution if you have busy code loops that never do any I/O or waiting/sleeping – try interrupting while(true); and you'll see what I mean.

          The code I posted is correct. Thread.stop() is not even called because no methods do timeout at the moment. The problem with hangs and jre crashes is something else, but I didn't have the time to peek at it yet.

          Show
          Dawid Weiss added a comment - Not really. Thread.stop is deprecated and I know the reasons it is deprecated, but this doesn't mean it doesn't work. interrupt() is not a solution if you have busy code loops that never do any I/O or waiting/sleeping – try interrupting while(true); and you'll see what I mean. The code I posted is correct. Thread.stop() is not even called because no methods do timeout at the moment. The problem with hangs and jre crashes is something else, but I didn't have the time to peek at it yet.
          Hide
          Dawid Weiss added a comment -

          Ehm. So... I figured out at least the reason for the fact the tests hang on mac/linux if the patch is applied. The reason is this in TestConcurrentMergeScheduler):

              @Override
              public void eval(MockDirectoryWrapper dir)  throws IOException {
                if (doFail && (Thread.currentThread().getName().equals("main") 
                    || Thread.currentThread().getName().equals("Main Thread"))) {
          

          because we're spawning threads named differently this test never finishes (because this condition never occurs). Is this supposed to check if the code runs a test case? If so then I suggest a different way of checking this: create a field on LuceneTestCase with

          volatile Thread current;
          

          this would be set by the runner (or a rule, whatever) and then test cases could verify they are running inside a test case by comparing Thread.currentThread() == current.

          Sounds good?

          Show
          Dawid Weiss added a comment - Ehm. So... I figured out at least the reason for the fact the tests hang on mac/linux if the patch is applied. The reason is this in TestConcurrentMergeScheduler): @Override public void eval(MockDirectoryWrapper dir) throws IOException { if (doFail && ( Thread .currentThread().getName().equals( "main" ) || Thread .currentThread().getName().equals( "Main Thread " ))) { because we're spawning threads named differently this test never finishes (because this condition never occurs). Is this supposed to check if the code runs a test case? If so then I suggest a different way of checking this: create a field on LuceneTestCase with volatile Thread current; this would be set by the runner (or a rule, whatever) and then test cases could verify they are running inside a test case by comparing Thread.currentThread() == current. Sounds good?
          Hide
          Mark Miller added a comment -

          interrupt() is not a solution if you have busy code loops that never do any I/O or waiting/sleeping

          Indeed - the correct statement is that stop would not stop a thread that is waiting if interrupt would also not stop it. But stop is certainly a step beyond interrupt in ending a threads life.

          Show
          Mark Miller added a comment - interrupt() is not a solution if you have busy code loops that never do any I/O or waiting/sleeping Indeed - the correct statement is that stop would not stop a thread that is waiting if interrupt would also not stop it. But stop is certainly a step beyond interrupt in ending a threads life.
          Hide
          Dawid Weiss added a comment -

          the correct statement is that stop would not stop a thread that is waiting if interrupt would also not stop it

          Ehm, too many negations for me, but I think you meant the other way around? Anyway, there's really little to it: stop() and interrupt() both act similar: they attempt to break the thread's execution by throwing an exception inside the thread's current call stack. The difference is that interrupt() sets a flag on the thread which is checked by wait/sleep method and I/O and then thrown as a checked exception and stop() tries to throw an unchecked exception as early as possible and theoretically can happen at any given statement.

          In a piece of software that cleans up resources using finally() and doesn't capture-and-ignore of Throwable/Error exceptions this shouldn't really matter that much and be safe.

          Simon was worried about calling stop() and possibly leaving junk on disk or doing weird stuff. True, this can happen, but in the end it's what will happen anyway if a thread is busy-looped infinitely or locked: either we will try to kill it or the jvm will at the end of its execution.

          I will modify the code to use a more graceful cascade of: interrupt() - wait a bit - then try to kill the thread because I still think it has advantages over just leaving the problematic thread running in the background. These disadvantages are:

          • the vm will never exit from tests if the threads are non-daemon threads,
          • background threads may interfere with other threads and provide noise that will not be reproducible.

          These are my motivating factors for using stop() as a last resort option for threads that did go into an endless loop (or exceeded a largeish timeout time). Simon, I know you have a gut feeling that calling stop() is wrong, but you need to convince me with arguments other than just your gut feeling

          Show
          Dawid Weiss added a comment - the correct statement is that stop would not stop a thread that is waiting if interrupt would also not stop it Ehm, too many negations for me, but I think you meant the other way around? Anyway, there's really little to it: stop() and interrupt() both act similar: they attempt to break the thread's execution by throwing an exception inside the thread's current call stack. The difference is that interrupt() sets a flag on the thread which is checked by wait/sleep method and I/O and then thrown as a checked exception and stop() tries to throw an unchecked exception as early as possible and theoretically can happen at any given statement. In a piece of software that cleans up resources using finally() and doesn't capture-and-ignore of Throwable/Error exceptions this shouldn't really matter that much and be safe. Simon was worried about calling stop() and possibly leaving junk on disk or doing weird stuff. True, this can happen, but in the end it's what will happen anyway if a thread is busy-looped infinitely or locked: either we will try to kill it or the jvm will at the end of its execution. I will modify the code to use a more graceful cascade of: interrupt() - wait a bit - then try to kill the thread because I still think it has advantages over just leaving the problematic thread running in the background. These disadvantages are: the vm will never exit from tests if the threads are non-daemon threads, background threads may interfere with other threads and provide noise that will not be reproducible. These are my motivating factors for using stop() as a last resort option for threads that did go into an endless loop (or exceeded a largeish timeout time). Simon, I know you have a gut feeling that calling stop() is wrong, but you need to convince me with arguments other than just your gut feeling
          Hide
          Dawid Weiss added a comment -

          Mike, Robert, Mark – can you confirm my understanding of this snippet:

          if (doFail && (Thread.currentThread().getName().equals("main") 
                    || Thread.currentThread().getName().equals("Main Thread"))) {
          

          what was it for (thread name checking)? Can this method be called from threads other than the main junit thread (and in such case shouldn't go into the if block?).

          Show
          Dawid Weiss added a comment - Mike, Robert, Mark – can you confirm my understanding of this snippet: if (doFail && ( Thread .currentThread().getName().equals( "main" ) || Thread .currentThread().getName().equals( "Main Thread " ))) { what was it for (thread name checking)? Can this method be called from threads other than the main junit thread (and in such case shouldn't go into the if block?).
          Hide
          Mark Miller added a comment -

          but I think you meant the other way around?

          I liked the tough negation wording! But I'm sticking to the order (as long as I'm peeling the negations off right). I can put it in the original terms though: if a waiting thread doesn't react to interrupt() it won't react to stop() either. But without the waiting, the statement is wrong.

          Show
          Mark Miller added a comment - but I think you meant the other way around? I liked the tough negation wording! But I'm sticking to the order (as long as I'm peeling the negations off right). I can put it in the original terms though: if a waiting thread doesn't react to interrupt() it won't react to stop() either. But without the waiting , the statement is wrong.
          Hide
          Robert Muir added a comment -

          what was it for (thread name checking)?

          Yes, its "main" on Sun/IBM/... and "Main Thread" on Jrockit

          Show
          Robert Muir added a comment - what was it for (thread name checking)? Yes, its "main" on Sun/IBM/... and "Main Thread" on Jrockit
          Hide
          Dawid Weiss added a comment -

          if a waiting thread doesn't react to interrupt() it won't react to stop() either. But without the waiting, the statement is wrong

          Yes, this is probably true. If it's waiting (on a monitor or i/o) and doesn't react to interrupt, then it's in a deep hole somewhere where nothing's going to help it

          Yes, its "main" on Sun/IBM/... and "Main Thread" on Jrockit

          Ok, so can I change it the way I suggested (i.e. have a "test" thread variable on the test superclass and compare to it instead)? You didn't explain the reason this code needs this comparison at all.

          Show
          Dawid Weiss added a comment - if a waiting thread doesn't react to interrupt() it won't react to stop() either. But without the waiting, the statement is wrong Yes, this is probably true. If it's waiting (on a monitor or i/o) and doesn't react to interrupt, then it's in a deep hole somewhere where nothing's going to help it Yes, its "main" on Sun/IBM/... and "Main Thread" on Jrockit Ok, so can I change it the way I suggested (i.e. have a "test" thread variable on the test superclass and compare to it instead)? You didn't explain the reason this code needs this comparison at all.
          Dawid Weiss made changes -
          Link This issue is related to LUCENE-3489 [ LUCENE-3489 ]
          Hide
          Dawid Weiss added a comment -

          Related discussion about test infrastructure in general.

          Show
          Dawid Weiss added a comment - Related discussion about test infrastructure in general.
          Simon Willnauer made changes -
          Fix Version/s 3.6 [ 12319070 ]
          Fix Version/s 3.5 [ 12317877 ]
          Robert Muir made changes -
          Assignee Dawid Weiss [ dweiss ]
          Fix Version/s 3.6 [ 12319070 ]
          Dawid Weiss made changes -
          Link This issue is part of LUCENE-3808 [ LUCENE-3808 ]
          Hide
          Dawid Weiss added a comment -

          This would be solved by LUCENE-3808 if it ever makes it in.

          Show
          Dawid Weiss added a comment - This would be solved by LUCENE-3808 if it ever makes it in.
          Dawid Weiss made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Uwe Schindler made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

            People

            • Assignee:
              Dawid Weiss
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development