Harmony
  1. Harmony
  2. HARMONY-1789

[DRLVM] Race condition between Thread.interrupt and Object.wait, Thread.sleep or join

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: DRLVM
    • Labels:
      None
    • Estimated Complexity:
      Moderate

      Description

      According to the specification wait/join or sleep methods should check if thread was interrupted before entering wait. DRLVM hythread implmentation do checks interrupted status(see hythread condition implmentation), however thread may be interrupted just after the status was checked and before entering condition wait. Thus it's required to ensure that waiting state will be reached having interrupted status unchanged.

      1. Test.java
        1 kB
        Nikolay Kuznetsov
      2. H-1789.patch
        11 kB
        Nikolay Kuznetsov
      3. HARMONY1789.java
        1 kB
        Salikh Zakirov
      4. H-1789-updated.patch
        10 kB
        Salikh Zakirov

        Issue Links

          Activity

          Hide
          Alexey Varlamov added a comment -

          Applied H-1789-update.patch + HARMONY-2217 at r476148.
          I suggest to close this issue, remained subtask can be resolved on their own.

          Show
          Alexey Varlamov added a comment - Applied H-1789-update.patch + HARMONY-2217 at r476148. I suggest to close this issue, remained subtask can be resolved on their own.
          Hide
          Alexey Varlamov added a comment -

          Well, it's all the same for multicore Win2003 server:

          • HARMONY1789 hangs without HARMONY-2217;
          • the test itself shows bad statistics for all RI, DRLVM and DRLVM+all patches;
          • ThreadTest still hangs.

          So, at least no regressions everywhere and slight improvement for single CPU systems on HARMONY1789 test.

          Show
          Alexey Varlamov added a comment - Well, it's all the same for multicore Win2003 server: HARMONY1789 hangs without HARMONY-2217 ; the test itself shows bad statistics for all RI, DRLVM and DRLVM+all patches; ThreadTest still hangs. So, at least no regressions everywhere and slight improvement for single CPU systems on HARMONY1789 test.
          Hide
          Alexey Varlamov added a comment -

          Here is what I found on multicore Linux machine (SUSE9 2xXeon):
          1) The HARMONY1789 hangs with pure H-1789-update.patch, the same as it does with original Nikolay's patch. Good news is that HARMONY-2217 patches (#1 actually) fixes this hang.
          2) All the patches do not improve run statistics on the mntioned machine . In fact BEA behaves equally badly, typical picture:
          1000 interrupts generated
          225 interrupts received, 84 polled, 141 exceptions
          FAILED
          Clearly the test is inherently unstable and is not really suitable for multicore systems... I've been lucky to get similar picture even on my Win laptop on BEA, but only once; next ~500 runs passed.
          3) Nikolay claimed [1] that this patch should solve hanging ThreadTest of classlib - but it does not. However this may be unrelated problem, I do not consider it as blocker for this issue.
          [1] http://issues.apache.org/jira/browse/HARMONY-1974#action_12449458

          Generally I agree that this series of patches improves code a lot; will test it a bit more and commit if nothing alarming happened.

          Show
          Alexey Varlamov added a comment - Here is what I found on multicore Linux machine (SUSE9 2xXeon): 1) The HARMONY1789 hangs with pure H-1789-update.patch, the same as it does with original Nikolay's patch. Good news is that HARMONY-2217 patches (#1 actually) fixes this hang. 2) All the patches do not improve run statistics on the mntioned machine . In fact BEA behaves equally badly, typical picture: 1000 interrupts generated 225 interrupts received, 84 polled, 141 exceptions FAILED Clearly the test is inherently unstable and is not really suitable for multicore systems... I've been lucky to get similar picture even on my Win laptop on BEA, but only once; next ~500 runs passed. 3) Nikolay claimed [1] that this patch should solve hanging ThreadTest of classlib - but it does not. However this may be unrelated problem, I do not consider it as blocker for this issue. [1] http://issues.apache.org/jira/browse/HARMONY-1974#action_12449458 Generally I agree that this series of patches improves code a lot; will test it a bit more and commit if nothing alarming happened.
          Hide
          Salikh Zakirov added a comment -

          DRLVM with applied H-1789-update.patch + HARMONY-2217 + HARMONY+2218 + HARMONY-2219 passed (almost all) tests on Linux/ia32 and Windows/ia32 (- gc.LOS -ThreadTest.testJoinlongint – see HARMONY-2204)

          Show
          Salikh Zakirov added a comment - DRLVM with applied H-1789-update.patch + HARMONY-2217 + HARMONY+2218 + HARMONY-2219 passed (almost all) tests on Linux/ia32 and Windows/ia32 (- gc.LOS -ThreadTest.testJoinlongint – see HARMONY-2204 )
          Hide
          Salikh Zakirov added a comment -

          HARMONY-2217 contains improvements on top of H-1789-updated.patch, particularly the point (2) above addressed.

          HARMONY-2218 and HARMONY-2219 contain splitted out parts of original H-1789.patch, that are not related directly to this issue.

          Show
          Salikh Zakirov added a comment - HARMONY-2217 contains improvements on top of H-1789-updated.patch, particularly the point (2) above addressed. HARMONY-2218 and HARMONY-2219 contain splitted out parts of original H-1789.patch, that are not related directly to this issue.
          Hide
          Salikh Zakirov added a comment -

          H-1789-updated.patch is an updated patch that fixes review items 1,3,4,5.
          The point (2) and unrelated changes in Thread.java deserve separate patches.

          This patch passed tests (except gc.LOS) on WinXP/ia32/msvc and Linux/ia32 SuSE9/gcc 3.3.3,
          and compile-tested on Linux/x86_64

          Show
          Salikh Zakirov added a comment - H-1789-updated.patch is an updated patch that fixes review items 1,3,4,5. The point (2) and unrelated changes in Thread.java deserve separate patches. This patch passed tests (except gc.LOS) on WinXP/ia32/msvc and Linux/ia32 SuSE9/gcc 3.3.3, and compile-tested on Linux/x86_64
          Hide
          Alexey Varlamov added a comment -

          Salikh,
          Thank you so much for thorough evaluation. And, splitting patch to related parts would really help, I was confused to see evidently unrelated changes in a bulk.

          PS. I felt a bit uneasy being obliged to express negative response, but we just cannot afford uncotrolled fixes in TM any longer.
          Now I'm happy to see we are on the right track.

          Show
          Alexey Varlamov added a comment - Salikh, Thank you so much for thorough evaluation. And, splitting patch to related parts would really help, I was confused to see evidently unrelated changes in a bulk. PS. I felt a bit uneasy being obliged to express negative response, but we just cannot afford uncotrolled fixes in TM any longer. Now I'm happy to see we are on the right track.
          Hide
          Salikh Zakirov added a comment -

          Some comments on the H-1789.patch. Nikolay, could you please review?
          If you have no objections, I would implement commens and attach updated patch.

          1) fflush(NULL); seems to be unrelated and, in fact, it mustn't be there (it can change the I/O behaviour from expected)

          2) What is the reason for introducing interrupter_thread_function() running in a separate thread?
          I guess the following substitution will do just as well (and without new threads):

          • if (thread->monitor && (hythread_monitor_try_enter(thread->monitor) == TM_ERROR_NONE)) { - hythread_monitor_notify_all(thread->monitor); - hythread_monitor_exit(thread->monitor); - }

            else

            { - status = hythread_create(&thread->interrupter, 0, 0, 0, interrupter_thread_function, (void *)thread); - assert (status == TM_ERROR_NONE); - }

            + if (thread->monitor)

            { hycond_notify(thread->monitor->condition); }

          and instead of doing
          wait_count = 0
          in notify_all() we could decrease wait_count in the waiting thread just after condvar_wait_impl() returned.

          3) the two changes to Thread.java looks to me as unrelated to both this issue and each other. (and therefore should not be committed together with this patch)
          By the way, why we have a waiting loop in Thread.start() at all? Spec does not mandate it. – I think I would file a separate JIRA to remove this loop.

          4) sleep_event and park_event are not used anymore and should be cleaned up

          5) commented out code should be removed, and some style corrected

          Show
          Salikh Zakirov added a comment - Some comments on the H-1789.patch. Nikolay, could you please review? If you have no objections, I would implement commens and attach updated patch. 1) fflush(NULL); seems to be unrelated and, in fact, it mustn't be there (it can change the I/O behaviour from expected) 2) What is the reason for introducing interrupter_thread_function() running in a separate thread? I guess the following substitution will do just as well (and without new threads): if (thread->monitor && (hythread_monitor_try_enter(thread->monitor) == TM_ERROR_NONE)) { - hythread_monitor_notify_all(thread->monitor); - hythread_monitor_exit(thread->monitor); - } else { - status = hythread_create(&thread->interrupter, 0, 0, 0, interrupter_thread_function, (void *)thread); - assert (status == TM_ERROR_NONE); - } + if (thread->monitor) { hycond_notify(thread->monitor->condition); } and instead of doing wait_count = 0 in notify_all() we could decrease wait_count in the waiting thread just after condvar_wait_impl() returned. 3) the two changes to Thread.java looks to me as unrelated to both this issue and each other. (and therefore should not be committed together with this patch) By the way, why we have a waiting loop in Thread.start() at all? Spec does not mandate it. – I think I would file a separate JIRA to remove this loop. 4) sleep_event and park_event are not used anymore and should be cleaned up 5) commented out code should be removed, and some style corrected
          Hide
          Gregory Shimansky added a comment -

          If test is not reliable to show the bug, maybe just change the main function to run the current main function many times?

          Show
          Gregory Shimansky added a comment - If test is not reliable to show the bug, maybe just change the main function to run the current main function many times?
          Hide
          Salikh Zakirov added a comment -

          HARMONY1789.java provides a sort of a reproducer for this issue.
          Formally, specification does not provide any guarantee that interruptions will be caught correctly,
          however, the test happen to pass several times in a row on both Bea and Sun JVMs.
          Expected output:
          100000 interrupts generated
          100000 interrupts received, 1 polled, 99999 exceptions
          PASSED
          (sometimes it is "0 polled, 100000 exceptions", but the sum is still the same).

          The output on current DRLVM (from trunk) may be:
          100000 interrupts generated
          99999 interrupts received, 1 polled, 99998 exceptions
          FAILED
          Running the test several times in a row, I've got following pass rates:
          DRLVM from SVN: failed 9 of 17
          DRLVM + H-1789.patch: failed 1 of 17.
          Bea: passed 17 times in a row

          Thus, this test, while not fully correct, demonstrates how this issue (as described in Description above) can affect DRLVM behaviour.

          Show
          Salikh Zakirov added a comment - HARMONY1789.java provides a sort of a reproducer for this issue. Formally, specification does not provide any guarantee that interruptions will be caught correctly, however, the test happen to pass several times in a row on both Bea and Sun JVMs. Expected output: 100000 interrupts generated 100000 interrupts received, 1 polled, 99999 exceptions PASSED (sometimes it is "0 polled, 100000 exceptions", but the sum is still the same). The output on current DRLVM (from trunk) may be: 100000 interrupts generated 99999 interrupts received, 1 polled, 99998 exceptions FAILED Running the test several times in a row, I've got following pass rates: DRLVM from SVN: failed 9 of 17 DRLVM + H-1789.patch: failed 1 of 17. Bea: passed 17 times in a row Thus, this test, while not fully correct, demonstrates how this issue (as described in Description above) can affect DRLVM behaviour.
          Hide
          Salikh Zakirov added a comment -

          1) The following change (on top of H-1789.patch) fixes unit test hang:
          — vm/thread/src/thread_native_park.c
          +++ vm/thread/src/thread_native_park.c
          @@ -59,7 +59,9 @@ IDATA VMCALL hythread_park(I_64 millis,
          return (t->state & TM_THREAD_STATE_INTERRUPTED) ? TM_ERROR_INTERRUPT : TM_ERROR_NONE;
          }

          + t->state |= TM_THREAD_STATE_PARKED;
          status = hycond_wait_interruptable(t->condition, t->mutex, millis, nanos);
          + t->state &= ~TM_THREAD_STATE_PARKED;

          //status = hysem_wait_interruptable(t->park_event, millis, nanos);
          //the status should be restored for j.u.c.LockSupport

          2) The patch is fixing the stated problem directly, however, the attached test case is not a reproducer for the problem. That's why it passes even without the patch.

          It looks like H-1789.patch has several unrelated changes. I am going now to split it into series of independent changes. I'll include the above fix as well.
          Generally, the approach looks correct and makes sleep(), park() and unpark() code easier to understand and reason about.

          Show
          Salikh Zakirov added a comment - 1) The following change (on top of H-1789.patch) fixes unit test hang: — vm/thread/src/thread_native_park.c +++ vm/thread/src/thread_native_park.c @@ -59,7 +59,9 @@ IDATA VMCALL hythread_park(I_64 millis, return (t->state & TM_THREAD_STATE_INTERRUPTED) ? TM_ERROR_INTERRUPT : TM_ERROR_NONE; } + t->state |= TM_THREAD_STATE_PARKED; status = hycond_wait_interruptable(t->condition, t->mutex, millis, nanos); + t->state &= ~TM_THREAD_STATE_PARKED; //status = hysem_wait_interruptable(t->park_event, millis, nanos); //the status should be restored for j.u.c.LockSupport 2) The patch is fixing the stated problem directly, however, the attached test case is not a reproducer for the problem. That's why it passes even without the patch. It looks like H-1789.patch has several unrelated changes. I am going now to split it into series of independent changes. I'll include the above fix as well. Generally, the approach looks correct and makes sleep(), park() and unpark() code easier to understand and reason about.
          Hide
          Alexey Varlamov added a comment -

          Nikolay,
          Sorry if the following sounds tough, but:
          1) It looks really strange - how much consideration has been put in the patch if you did not even tried to run unit tests?
          2) It is hard to follow, how the H-1789.patch relates to initial issue report. Attached Test.java works just fine w/o the patch.

          Could you please address the above notes?

          Show
          Alexey Varlamov added a comment - Nikolay, Sorry if the following sounds tough, but: 1) It looks really strange - how much consideration has been put in the patch if you did not even tried to run unit tests? 2) It is hard to follow, how the H-1789.patch relates to initial issue report. Attached Test.java works just fine w/o the patch. Could you please address the above notes?
          Hide
          Elena Semukhina added a comment -

          Nikolay,
          I tried the patch. It solves the classlib ThreadTest hanging but C unit test test_java_park hangs now:

          [echo] ## Executing C unit test: test_java_park
          [hangs...]

          Show
          Elena Semukhina added a comment - Nikolay, I tried the patch. It solves the classlib ThreadTest hanging but C unit test test_java_park hangs now: [echo] ## Executing C unit test: test_java_park [hangs...]
          Hide
          Nikolay Kuznetsov added a comment -

          Attached patch contains fix for this problem. It includes alternative implementation of thread's sleep,park and wait methods. In all above cases threads state is serialized using common lock in case of sleep and park and monitor is used in wait case.

          Also this patch contains isAlive check to the j.l.Thread.stop(Throwable) method.

          Show
          Nikolay Kuznetsov added a comment - Attached patch contains fix for this problem. It includes alternative implementation of thread's sleep,park and wait methods. In all above cases threads state is serialized using common lock in case of sleep and park and monitor is used in wait case. Also this patch contains isAlive check to the j.l.Thread.stop(Throwable) method.
          Hide
          Elena Semukhina added a comment -

          I still observe ThreadTest.testInterrupt_Sleeping() intermittent failures when running drlvm kernel tests even after H-1592 and H-1823 commits. But the attached test passes for me when this.wait () replaced with Thread.sleep() in the ThreadWaiting.run() method.

          Show
          Elena Semukhina added a comment - I still observe ThreadTest.testInterrupt_Sleeping() intermittent failures when running drlvm kernel tests even after H-1592 and H-1823 commits. But the attached test passes for me when this.wait () replaced with Thread.sleep() in the ThreadWaiting.run() method.
          Hide
          Geir Magnusson Jr added a comment -

          Should we back something out, or just try and find the problem?

          Show
          Geir Magnusson Jr added a comment - Should we back something out, or just try and find the problem?
          Hide
          Elena Semukhina added a comment -

          As it appeared, some other issues caused pass rate degradation (H-1592, H-1816, H-1823).
          This issue impacts on drlvm kernel unit tests and does not impact on classlib tests.

          Show
          Elena Semukhina added a comment - As it appeared, some other issues caused pass rate degradation (H-1592, H-1816, H-1823). This issue impacts on drlvm kernel unit tests and does not impact on classlib tests.
          Hide
          Elena Semukhina added a comment -

          [drlvm][unit] Blocked http://wiki.apache.org/harmony/Unit_Tests_Pass_on_DRLVM

          This bug drastically increased the number of classlib tests failures. Today we have 66 failures in luni module while yesterday there were only 8.

          To reproduce classlib tests failures please run
          ant -Dtest.jre.home=%DRLVM% -Dbuild.module=luni test

          Show
          Elena Semukhina added a comment - [drlvm] [unit] Blocked http://wiki.apache.org/harmony/Unit_Tests_Pass_on_DRLVM This bug drastically increased the number of classlib tests failures. Today we have 66 failures in luni module while yesterday there were only 8. To reproduce classlib tests failures please run ant -Dtest.jre.home=%DRLVM% -Dbuild.module=luni test
          Hide
          Nikolay Kuznetsov added a comment -

          The attached tests case is an except from kernel.ThreadTest, which fails conceivably due to this issue, however this case requires additional investigatiom.

          Show
          Nikolay Kuznetsov added a comment - The attached tests case is an except from kernel.ThreadTest, which fails conceivably due to this issue, however this case requires additional investigatiom.

            People

            • Assignee:
              Alexey Varlamov
              Reporter:
              Nikolay Kuznetsov
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development