Lucene - Core
  1. Lucene - Core
  2. LUCENE-2730

intermittent deadlock in TestAtomicUpdate,TestIndexWriterExceptions

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.9.3, 3.0.2
    • Fix Version/s: 2.9.4, 3.0.3, 3.1, 4.0-ALPHA
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      While backporting issues for 2.9.x/3.0.x release I hit deadlocks in these two tests, under both test-core and test-tag.

      1. LUCENE-2730.patch
        4 kB
        Michael McCandless

        Activity

        Michael McCandless created issue -
        Hide
        Robert Muir added a comment -

        I've experienced this before too (even a few months ago).

        I'll try to find what rev this was introduced.

        Show
        Robert Muir added a comment - I've experienced this before too (even a few months ago). I'll try to find what rev this was introduced.
        Hide
        Michael McCandless added a comment -

        Attached patch.

        The deadlock in TestIndexWriterExceptions was due to locks being acquired in the wrong order; to fix this I've removed synchronized from TestHash.abort. DocumentsWriter idles all other threads before calling abort so we don't need to sync. I'll also remove this from trunk and 3.x.

        TestAtomicUpdate's deadlock only happens when autoCommit=true, when one thread has called commit and the flush by that commit tries to spawn new merges, which enters CMS who then hits the max merge thread count and waits for merge threads to finish while another merge thread is attempting an autoCommit. I don't think we need to make any changes to 3.x/trunk for this (autoCommit is gone).

        Show
        Michael McCandless added a comment - Attached patch. The deadlock in TestIndexWriterExceptions was due to locks being acquired in the wrong order; to fix this I've removed synchronized from TestHash.abort. DocumentsWriter idles all other threads before calling abort so we don't need to sync. I'll also remove this from trunk and 3.x. TestAtomicUpdate's deadlock only happens when autoCommit=true, when one thread has called commit and the flush by that commit tries to spawn new merges, which enters CMS who then hits the max merge thread count and waits for merge threads to finish while another merge thread is attempting an autoCommit. I don't think we need to make any changes to 3.x/trunk for this (autoCommit is gone).
        Michael McCandless made changes -
        Field Original Value New Value
        Attachment LUCENE-2730.patch [ 12458479 ]
        Hide
        Uwe Schindler added a comment -

        It's there since the very beginning of 2.9/3.0. I had the hangs also in some test runs when releaseing 3.0.0! I mendioned that, but we said "its a test failure, lets ignore it).

        Show
        Uwe Schindler added a comment - It's there since the very beginning of 2.9/3.0. I had the hangs also in some test runs when releaseing 3.0.0! I mendioned that, but we said "its a test failure, lets ignore it).
        Hide
        Uwe Schindler added a comment -

        It happens also in backwards tests (much more often!)

        Show
        Uwe Schindler added a comment - It happens also in backwards tests (much more often!)
        Hide
        Robert Muir added a comment -

        hmm ok, because I'm able to easily reproduce the TestIndexWriterExceptions with 3.0.2,
        but i haven't had any luck with TestAtomicUpdate (seems like a different problem)

        pretty sure TestAtomicUpdate doesn't hang often even in 3.0.x for me now, i'll try on linux
        too just in case.

        Show
        Robert Muir added a comment - hmm ok, because I'm able to easily reproduce the TestIndexWriterExceptions with 3.0.2, but i haven't had any luck with TestAtomicUpdate (seems like a different problem) pretty sure TestAtomicUpdate doesn't hang often even in 3.0.x for me now, i'll try on linux too just in case.
        Hide
        Uwe Schindler added a comment -

        Windows 7 1.5.0_22 64 hangy very often in test-tag (every 2 times)

        Show
        Uwe Schindler added a comment - Windows 7 1.5.0_22 64 hangy very often in test-tag (every 2 times)
        Hide
        Robert Muir added a comment -

        Right, mine hangs very often, but on TestIndexWriterExceptions, not TestAtomicUpdate.

        its the TestAtomicUpdate hang that is hard to reproduce for me

        Show
        Robert Muir added a comment - Right, mine hangs very often, but on TestIndexWriterExceptions, not TestAtomicUpdate. its the TestAtomicUpdate hang that is hard to reproduce for me
        Hide
        Uwe Schindler added a comment -

        TestAtomicUpdate is the one that hangs for me in test-tag every 2 times. Don't know why, maybe the JVM or processor speed?

        Show
        Uwe Schindler added a comment - TestAtomicUpdate is the one that hangs for me in test-tag every 2 times. Don't know why, maybe the JVM or processor speed?
        Hide
        Michael McCandless added a comment -

        Uwe can you test the patch and see if that fixes your TestAtomicUpdate hang?

        Show
        Michael McCandless added a comment - Uwe can you test the patch and see if that fixes your TestAtomicUpdate hang?
        Hide
        Michael McCandless added a comment -

        OK so the TestAtomicUpdate hang only affects 2.9.x, since we removed autoCommit in 3.0.x.

        Show
        Michael McCandless added a comment - OK so the TestAtomicUpdate hang only affects 2.9.x, since we removed autoCommit in 3.0.x.
        Hide
        Robert Muir added a comment -

        TestAtomicUpdate is the one that hangs for me in test-tag every 2 times. Don't know why, maybe the JVM or processor speed?

        No, its because it does not hang in 3.0.x, only 2.9.x, they are not the same bugfix level

        Show
        Robert Muir added a comment - TestAtomicUpdate is the one that hangs for me in test-tag every 2 times. Don't know why, maybe the JVM or processor speed? No, its because it does not hang in 3.0.x, only 2.9.x, they are not the same bugfix level
        Hide
        Uwe Schindler added a comment -

        I have seen the synchronized also in TermsHash in 3.x... Do we need to remove it there, too?

        Show
        Uwe Schindler added a comment - I have seen the synchronized also in TermsHash in 3.x... Do we need to remove it there, too?
        Hide
        Michael McCandless added a comment -

        I have seen the synchronized also in TermsHash in 3.x... Do we need to remove it there, too?

        I will also remove from 3.x & trunk, but, this does not lead to deadlock because we've simplified how IW re-uses RAM, so this particular deadlock case can't occur.

        Still we should remove it, so I will...

        Show
        Michael McCandless added a comment - I have seen the synchronized also in TermsHash in 3.x... Do we need to remove it there, too? I will also remove from 3.x & trunk, but, this does not lead to deadlock because we've simplified how IW re-uses RAM, so this particular deadlock case can't occur. Still we should remove it, so I will...
        Hide
        Robert Muir added a comment -

        I ran with the patch here on 2.9.x, and still encountered the hang in TestAtomicUpdate...

        Show
        Robert Muir added a comment - I ran with the patch here on 2.9.x, and still encountered the hang in TestAtomicUpdate...
        Hide
        Uwe Schindler added a comment -

        For me it now passes

        In my opinion we should remove the stupid synchronized in TermsHash completely (also trunk and 3.x, I removed at least the compact()-sync). This class needs no sync.

        Show
        Uwe Schindler added a comment - For me it now passes In my opinion we should remove the stupid synchronized in TermsHash completely (also trunk and 3.x, I removed at least the compact()-sync). This class needs no sync.
        Hide
        Uwe Schindler added a comment - - edited

        Hangs in this and other tests can also come from incorrect test MT programming. Just think about Michael Busch's talk @ LuceneRev. The test uses some fields in the thread that are not volatile or synchronized on. The test is fixed in trunk/3.x during our total test overhaul. There are about 10 tests using the same wrong thread inner class.

        If we want to fix that we must also fix backwards... happy tag creating again g

        Show
        Uwe Schindler added a comment - - edited Hangs in this and other tests can also come from incorrect test MT programming. Just think about Michael Busch's talk @ LuceneRev. The test uses some fields in the thread that are not volatile or synchronized on. The test is fixed in trunk/3.x during our total test overhaul. There are about 10 tests using the same wrong thread inner class. If we want to fix that we must also fix backwards... happy tag creating again g
        Hide
        Robert Muir added a comment -

        well, it still doesnt fix the problem for me, so i dont think we should go changing things yet, until we find the real cause of the hang.

        Show
        Robert Muir added a comment - well, it still doesnt fix the problem for me, so i dont think we should go changing things yet, until we find the real cause of the hang.
        Hide
        Robert Muir added a comment -

        I made my own test runner, and ran this test for hours, but i couldnt reproduce the hang.

        There is a general problem on windows: if we run a test and it hangs there is no way to get the SIGQUIT stacktrace.
        if you use control-break, it only gives you the stacktrace of the 'ant' jvm itself... somehow I would like to fix this later.

        but don't let me stand in the way here... perhaps i made some mistake/didn't ant clean, something silly! sorry!

        Show
        Robert Muir added a comment - I made my own test runner, and ran this test for hours, but i couldnt reproduce the hang. There is a general problem on windows: if we run a test and it hangs there is no way to get the SIGQUIT stacktrace. if you use control-break, it only gives you the stacktrace of the 'ant' jvm itself... somehow I would like to fix this later. but don't let me stand in the way here... perhaps i made some mistake/didn't ant clean, something silly! sorry!
        Hide
        Michael McCandless added a comment -

        OK I'll commit... we can reopen if the hangs happen again!

        Show
        Michael McCandless added a comment - OK I'll commit... we can reopen if the hangs happen again!
        Michael McCandless committed 1029333 (1 file)
        Reviews: none

        LUCENE-2730: remove sync from TermsHash.abort (causes deadlock on earlier releases)

        Hide
        Michael McCandless added a comment -

        In my opinion we should remove the stupid synchronized in TermsHash completely (also trunk and 3.x, I removed at least the compact()-sync). This class needs no sync.

        We have to be somewhat careful w/ this.

        Eg in 2.9.x/3.0.x some of the sync is needed because IW calls freeRAM "freely", ie, while other threads are indexing. IW still calls freeRAM in 3.x/4.0, but it's a no-op there since we don't recycle RAM as aggressively anymore. So likely we could remove sync from TermsHash on 3.x/trunk, but let's do that another day...

        Show
        Michael McCandless added a comment - In my opinion we should remove the stupid synchronized in TermsHash completely (also trunk and 3.x, I removed at least the compact()-sync). This class needs no sync. We have to be somewhat careful w/ this. Eg in 2.9.x/3.0.x some of the sync is needed because IW calls freeRAM "freely", ie, while other threads are indexing. IW still calls freeRAM in 3.x/4.0, but it's a no-op there since we don't recycle RAM as aggressively anymore. So likely we could remove sync from TermsHash on 3.x/trunk, but let's do that another day...
        Michael McCandless committed 1029348 (1 file)
        Michael McCandless committed 1029367 (29 files)
        Reviews: none

        LUCENE-2730: fix intermittent deadlock cases

        Lucene lucene_3_0
        Michael McCandless committed 1029373 (94 files)
        Reviews: none

        LUCENE-2730: remove sync from TermsHash.abort (causes deadlock on earlier releases)

        Lucene branch_3x
        Michael McCandless made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Fix Version/s 3.1 [ 12314822 ]
        Fix Version/s 4.0 [ 12314025 ]
        Resolution Fixed [ 1 ]
        Uwe Schindler made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Mark Thomas made changes -
        Workflow jira [ 12525607 ] Default workflow, editable Closed status [ 12563783 ]
        Mark Thomas made changes -
        Workflow Default workflow, editable Closed status [ 12563783 ] jira [ 12585323 ]

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development