Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.0-ALPHA, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      found by http://jenkins.sd-datasolutions.de/job/Lucene-Solr-4.x-Windows-Java7-64/70/

      rollback should never throw this exception, as it documents it clears any pendingcommits.

      but it calls closeInternal outside of any sync block, so it looks like there is a race here.

      1. deadlock.log
        15 kB
        Michael McCandless
      2. fail.log
        2.91 MB
        Michael McCandless
      3. fail.log
        20 kB
        Michael McCandless
      4. LUCENE-4147.patch
        19 kB
        Simon Willnauer
      5. LUCENE-4147.patch
        20 kB
        Simon Willnauer
      6. LUCENE-4147.patch
        19 kB
        Simon Willnauer
      7. LUCENE-4147.patch
        17 kB
        Michael McCandless
      8. LUCENE-4147.patch
        6 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Test case that fails more easily from the bug ...

        Show
        Michael McCandless added a comment - Test case that fails more easily from the bug ...
        Hide
        Michael McCandless added a comment -

        Patch, acquiring the commitLock around close and rollback, and adding
        the ensureOpen before prepareCommit. However, the test still fails
        after a few hundred beasting iterations, I think because of thread safety
        issues where one thread calls docWriter.abort while another is
        flushing ... not sure how to fix that one yet. Simon maybe you can
        have a look?

        Show
        Michael McCandless added a comment - Patch, acquiring the commitLock around close and rollback, and adding the ensureOpen before prepareCommit. However, the test still fails after a few hundred beasting iterations, I think because of thread safety issues where one thread calls docWriter.abort while another is flushing ... not sure how to fix that one yet. Simon maybe you can have a look?
        Hide
        Simon Willnauer added a comment -

        mike, do we really need to acquire the commit lock? from my perspective it would be enough to add an ensure open when we assigne pendingCommit (inside the sync block) so that racing threads hit already close exceptions.

        regarding the thread safety issue in DocWriter can you paste the trace?

        Show
        Simon Willnauer added a comment - mike, do we really need to acquire the commit lock? from my perspective it would be enough to add an ensure open when we assigne pendingCommit (inside the sync block) so that racing threads hit already close exceptions. regarding the thread safety issue in DocWriter can you paste the trace?
        Hide
        Michael McCandless added a comment -

        mike, do we really need to acquire the commit lock?

        The problem is rollback forcefully clears the pendingCommit and then deletes any files it had (alone) referenced, so if a commit is running concurrently the fsyncs will fail since the files were deleted.

        Also: it doesn't really make sense to allow rollback and commit to proceed concurrently? Why would an app need this? Seems like we can simplify the code by making them exclusive.

        regarding the thread safety issue in DocWriter can you paste the trace?

        Will do ... need to re-beast.

        Show
        Michael McCandless added a comment - mike, do we really need to acquire the commit lock? The problem is rollback forcefully clears the pendingCommit and then deletes any files it had (alone) referenced, so if a commit is running concurrently the fsyncs will fail since the files were deleted. Also: it doesn't really make sense to allow rollback and commit to proceed concurrently? Why would an app need this? Seems like we can simplify the code by making them exclusive. regarding the thread safety issue in DocWriter can you paste the trace? Will do ... need to re-beast.
        Hide
        Michael McCandless added a comment -

        Here's the verbose output from a failure w/ the patch ...

        Show
        Michael McCandless added a comment - Here's the verbose output from a failure w/ the patch ...
        Hide
        Michael McCandless added a comment -

        I think we should commit the current patch, and then leave the issue open for the docWriter abort/flush thread safety.

        The current patch should fix the Jenkins test failures we're seeing (but the new test here may still sometimes fail until we fix the abort/flush thread safety issue).

        Show
        Michael McCandless added a comment - I think we should commit the current patch, and then leave the issue open for the docWriter abort/flush thread safety. The current patch should fix the Jenkins test failures we're seeing (but the new test here may still sometimes fail until we fix the abort/flush thread safety issue).
        Hide
        Simon Willnauer added a comment -

        patch fixing the DWPT issue. The problem was that we didn't close the DW before aborting. That means we didn't invalidate the thread states in DWPTThreadPool and an already waiting Thread could acquire the state before we eventually close the DW. If that happens together with a low ram buffer / low maxBufferedDocs we hit an exception on flush since IFD deleted the files already. Now since we first close and then abort this can't happen anymore and will cause an AlreadClosedException for the indexing thread.

        Show
        Simon Willnauer added a comment - patch fixing the DWPT issue. The problem was that we didn't close the DW before aborting. That means we didn't invalidate the thread states in DWPTThreadPool and an already waiting Thread could acquire the state before we eventually close the DW. If that happens together with a low ram buffer / low maxBufferedDocs we hit an exception on flush since IFD deleted the files already. Now since we first close and then abort this can't happen anymore and will cause an AlreadClosedException for the indexing thread.
        Hide
        Simon Willnauer added a comment -

        one more think, I think this is a general problem that exists before so we might need a CHANGES.TXT entry?

        Show
        Simon Willnauer added a comment - one more think, I think this is a general problem that exists before so we might need a CHANGES.TXT entry?
        Hide
        Michael McCandless added a comment -

        The problem was that we didn't close the DW before aborting.

        Aha! Thanks.

        I'll beast this, and add a CHANGES entry ...

        Show
        Michael McCandless added a comment - The problem was that we didn't close the DW before aborting. Aha! Thanks. I'll beast this, and add a CHANGES entry ...
        Hide
        Michael McCandless added a comment -

        Hmm, hit a failure (verbose log attached)... haven't tried to understand it yet.

        Show
        Michael McCandless added a comment - Hmm, hit a failure (verbose log attached)... haven't tried to understand it yet.
        Hide
        Simon Willnauer added a comment -

        I see what's happening. There is a thread that started flushing before we call rollback but finishes after we already wiped its files. I think we don't have a choice here but wait for the flushes to finish with flushControl.waitForFlush(); I will prepare a new patch tomorrow.

        Show
        Simon Willnauer added a comment - I see what's happening. There is a thread that started flushing before we call rollback but finishes after we already wiped its files. I think we don't have a choice here but wait for the flushes to finish with flushControl.waitForFlush(); I will prepare a new patch tomorrow.
        Hide
        Simon Willnauer added a comment -

        new patch that waits for running flushes in abort after all possible DWPTs are aborted.

        Show
        Simon Willnauer added a comment - new patch that waits for running flushes in abort after all possible DWPTs are aborted.
        Hide
        Michael McCandless added a comment -

        Thanks Simon; I'll re-beast.

        Show
        Michael McCandless added a comment - Thanks Simon; I'll re-beast.
        Hide
        Michael McCandless added a comment -

        Well, the good news is beasting didn't uncover a failure ... but the bad news is: it uncovered a deadlock/hang!! I'm attaching thread stacks.

        Show
        Michael McCandless added a comment - Well, the good news is beasting didn't uncover a failure ... but the bad news is: it uncovered a deadlock/hang!! I'm attaching thread stacks.
        Hide
        Simon Willnauer added a comment -

        I feared that this is gonna hang at some point. I moved the docWriter abort / close out of the sync block in IW rollbackInternal and beasted the new test + all other tests for hours now. I think this is fine to move that out, no need really to keep the IW lock since we already have the commit lock in our hands. I didn't see a failure so far.

        Show
        Simon Willnauer added a comment - I feared that this is gonna hang at some point. I moved the docWriter abort / close out of the sync block in IW rollbackInternal and beasted the new test + all other tests for hours now. I think this is fine to move that out, no need really to keep the IW lock since we already have the commit lock in our hands. I didn't see a failure so far.
        Hide
        Michael McCandless added a comment -

        Thanks Simon, I'll review & beast ...

        Show
        Michael McCandless added a comment - Thanks Simon, I'll review & beast ...
        Hide
        Michael McCandless added a comment -

        OK, patch looks good, and beasting ran for 4 hours w/ no failures/hangs ... I'll commit. Thanks Simon!

        Show
        Michael McCandless added a comment - OK, patch looks good, and beasting ran for 4 hours w/ no failures/hangs ... I'll commit. Thanks Simon!

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development