Solr
  1. Solr
  2. SOLR-7956

There are interrupts on shutdown in places that can cause ChannelAlreadyClosed exceptions which prevents proper closing of transaction logs.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.4, 6.0
    • Component/s: None
    • Labels:
      None

      Description

      Found this while beast testing HttpPartitionTest.

      1. SOLR-7956.patch
        5 kB
        Mark Miller
      2. SOLR-7956.patch
        4 kB
        Mark Miller
      3. SOLR-7956.patch
        3 kB
        Mark Miller
      4. SOLR-7956-commit-tracker.patch
        7 kB
        Scott Blum

        Activity

        Hide
        Mark Miller added a comment -

        I've changed most executors to no longer interrupt on shutdown. Tests work out okay.

        Show
        Mark Miller added a comment - I've changed most executors to no longer interrupt on shutdown. Tests work out okay.
        Hide
        ASF subversion and git services added a comment -

        Commit 1697707 from Mark Miller in branch 'dev/trunk'
        [ https://svn.apache.org/r1697707 ]

        SOLR-7956: There are interrupts on shutdown in places that can cause ChannelAlreadyClosed exceptions which prevents proper closing of transaction logs.

        Show
        ASF subversion and git services added a comment - Commit 1697707 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1697707 ] SOLR-7956 : There are interrupts on shutdown in places that can cause ChannelAlreadyClosed exceptions which prevents proper closing of transaction logs.
        Hide
        Scott Blum added a comment -

        Interrupts permanently breaking index writers is our #1 production issue right now, so I've just started looking at all the code and JIRAs related to this problem. Glad I found this one, I might try a backport to our ~5.2.1 instance.

        One question, have you also audited the existing Future.cancel(true) calls? I see several of these in Solr. The ones in ReplicationHandler and CommitTracker seem suspect to me, not sure about HttpShardHandler (maybe that one doesn't hit any core stuff directly).

        Show
        Scott Blum added a comment - Interrupts permanently breaking index writers is our #1 production issue right now, so I've just started looking at all the code and JIRAs related to this problem. Glad I found this one, I might try a backport to our ~5.2.1 instance. One question, have you also audited the existing Future.cancel(true) calls? I see several of these in Solr. The ones in ReplicationHandler and CommitTracker seem suspect to me, not sure about HttpShardHandler (maybe that one doesn't hit any core stuff directly).
        Hide
        Yonik Seeley added a comment -

        The ones in ReplicationHandler and CommitTracker seem suspect to me

        I think you're right... the one in CommitTracker does look suspect, esp since we now share the IndexWriter across UpdateHandlers on a core reload.

        Show
        Yonik Seeley added a comment - The ones in ReplicationHandler and CommitTracker seem suspect to me I think you're right... the one in CommitTracker does look suspect, esp since we now share the IndexWriter across UpdateHandlers on a core reload.
        Hide
        Scott Blum added a comment -

        Supposing I make the following adjustments:

        pending.cancel(true) => pending.cancel(false)
        scheduler.shutdownNow() => scheduler.shutdown();

        Then, do I need to wait for the scheduler to actually terminate before exiting close()?

        Show
        Scott Blum added a comment - Supposing I make the following adjustments: pending.cancel(true) => pending.cancel(false) scheduler.shutdownNow() => scheduler.shutdown(); Then, do I need to wait for the scheduler to actually terminate before exiting close()?
        Hide
        Scott Blum added a comment -

        I also see a bunch of calls to ExecutorUtil.shutdownNowAndAwaitTermination(), that seems pretty dangerous. It's difficult to audit exactly how those executors are being used, but it seems worrisome.

        Show
        Scott Blum added a comment - I also see a bunch of calls to ExecutorUtil.shutdownNowAndAwaitTermination(), that seems pretty dangerous. It's difficult to audit exactly how those executors are being used, but it seems worrisome.
        Hide
        Scott Blum added a comment -

        fsyncService.shutdownNow() in IndexFetcher.cleanup() seems particularly dangerous, as that executor's tasks dive directly into channel code

        Show
        Scott Blum added a comment - fsyncService.shutdownNow() in IndexFetcher.cleanup() seems particularly dangerous, as that executor's tasks dive directly into channel code
        Hide
        Mark Miller added a comment -

        Okay, this patch drops use of shutdownNow and cancel(true) - miss anything?

        Show
        Mark Miller added a comment - Okay, this patch drops use of shutdownNow and cancel(true) - miss anything?
        Hide
        Mark Miller added a comment -
               try {
        +        // we allow interrupts because ref counted closing is done - we want out
                 ExecutorUtil.shutdownNowAndAwaitTermination(recoveryExecutor);
               } catch (Exception e) {
        

        Whoops - I'll flip that one too. I think it's fine, but if we can take out all use, it simplifies things a bit.

        Show
        Mark Miller added a comment - try { + // we allow interrupts because ref counted closing is done - we want out ExecutorUtil.shutdownNowAndAwaitTermination(recoveryExecutor); } catch (Exception e) { Whoops - I'll flip that one too. I think it's fine, but if we can take out all use, it simplifies things a bit.
        Hide
        Mark Miller added a comment -

        Hmm...replication will actually hang in some tests when we don't interrupt. I remember this now from the last time I worked on this issue.

        Show
        Mark Miller added a comment - Hmm...replication will actually hang in some tests when we don't interrupt. I remember this now from the last time I worked on this issue.
        Hide
        Mark Miller added a comment -

        Okay, updated patch. Can't wait for shutdown in the replication handler pre close or we get locked up in core reload.

        Show
        Mark Miller added a comment - Okay, updated patch. Can't wait for shutdown in the replication handler pre close or we get locked up in core reload.
        Hide
        Mark Miller added a comment -

        This does slow down the tests a few minutes - though they are already 6 minutes slower than they were for me this spring anyway.

        Show
        Mark Miller added a comment - This does slow down the tests a few minutes - though they are already 6 minutes slower than they were for me this spring anyway.
        Hide
        Scott Blum added a comment -

        Some of those might be okay, I don't know the code well enough. However, the shutdownNow() call in CommitTracker is definitely fatal. I've attached a patch to demonstrate with a test case.

        FWIW, the pending.cancel(true), while dangerous, is far less likely to be hit because the task nulls out the pending field as soon as it starts to run. But the shutdownNow() will interrupt the thread even after the field is nulled out, so it's a lot more likely to really hit.

        Show
        Scott Blum added a comment - Some of those might be okay, I don't know the code well enough. However, the shutdownNow() call in CommitTracker is definitely fatal. I've attached a patch to demonstrate with a test case. FWIW, the pending.cancel(true), while dangerous, is far less likely to be hit because the task nulls out the pending field as soon as it starts to run. But the shutdownNow() will interrupt the thread even after the field is nulled out, so it's a lot more likely to really hit.
        Hide
        Scott Blum added a comment -

        private volatile boolean closed = false;

        ^^ new field is not actually used

        Show
        Scott Blum added a comment - private volatile boolean closed = false; ^^ new field is not actually used
        Hide
        Mark Miller added a comment -

        Speak of the devil, just saw an incident where that commit tracker interrupter is screwing up both Solr and HDFS.

        Show
        Mark Miller added a comment - Speak of the devil, just saw an incident where that commit tracker interrupter is screwing up both Solr and HDFS.
        Hide
        ASF subversion and git services added a comment -

        Commit 1700177 from Mark Miller in branch 'dev/trunk'
        [ https://svn.apache.org/r1700177 ]

        SOLR-7956: There are interrupts on shutdown in places that can cause ChannelAlreadyClosed exceptions which prevents proper closing of transaction logs, interfere with the IndexWriter, the hdfs client and other things.

        Show
        ASF subversion and git services added a comment - Commit 1700177 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1700177 ] SOLR-7956 : There are interrupts on shutdown in places that can cause ChannelAlreadyClosed exceptions which prevents proper closing of transaction logs, interfere with the IndexWriter, the hdfs client and other things.
        Hide
        ASF subversion and git services added a comment -

        Commit 1700178 from Mark Miller in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1700178 ]

        SOLR-7956: There are interrupts on shutdown in places that can cause ChannelAlreadyClosed exceptions which prevents proper closing of transaction logs.

        Show
        ASF subversion and git services added a comment - Commit 1700178 from Mark Miller in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1700178 ] SOLR-7956 : There are interrupts on shutdown in places that can cause ChannelAlreadyClosed exceptions which prevents proper closing of transaction logs.
        Hide
        ASF subversion and git services added a comment -

        Commit 1700179 from Mark Miller in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1700179 ]

        SOLR-7956: There are interrupts on shutdown in places that can cause ChannelAlreadyClosed exceptions which prevents proper closing of transaction logs, interfere with the IndexWriter, the hdfs client and other things.

        Show
        ASF subversion and git services added a comment - Commit 1700179 from Mark Miller in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1700179 ] SOLR-7956 : There are interrupts on shutdown in places that can cause ChannelAlreadyClosed exceptions which prevents proper closing of transaction logs, interfere with the IndexWriter, the hdfs client and other things.
        Hide
        ASF subversion and git services added a comment -

        Commit 1700185 from Mark Miller in branch 'dev/trunk'
        [ https://svn.apache.org/r1700185 ]

        SOLR-7956: Remove a few more interrupt causes.

        Show
        ASF subversion and git services added a comment - Commit 1700185 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1700185 ] SOLR-7956 : Remove a few more interrupt causes.
        Hide
        ASF subversion and git services added a comment -

        Commit 1700188 from Mark Miller in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1700188 ]

        SOLR-7956: Remove a few more interrupt causes.

        Show
        ASF subversion and git services added a comment - Commit 1700188 from Mark Miller in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1700188 ] SOLR-7956 : Remove a few more interrupt causes.
        Hide
        ASF subversion and git services added a comment -

        Commit 1700279 from Mark Miller in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1700279 ]

        SOLR-7956: Missed an interrupt shutdown on 5x backport.

        Show
        ASF subversion and git services added a comment - Commit 1700279 from Mark Miller in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1700279 ] SOLR-7956 : Missed an interrupt shutdown on 5x backport.
        Hide
        Scott Blum added a comment -

        Mark Miller LGTM, thanks!

        One question about:

        https://svn.apache.org/viewvc/lucene/dev/branches/branch_5x/solr/core/src/java/org/apache/solr/update/CommitTracker.java?r1=1700188&r2=1700187&pathrev=1700188

        I don't remember why anymore, but I may have needed to use `ExecutorUtil.shutdownAndAwaitTermination(scheduler);` locally when I was testing this. Curious if you have concrete reasons for preferring one or the other here.

        Show
        Scott Blum added a comment - Mark Miller LGTM, thanks! One question about: https://svn.apache.org/viewvc/lucene/dev/branches/branch_5x/solr/core/src/java/org/apache/solr/update/CommitTracker.java?r1=1700188&r2=1700187&pathrev=1700188 I don't remember why anymore, but I may have needed to use `ExecutorUtil.shutdownAndAwaitTermination(scheduler);` locally when I was testing this. Curious if you have concrete reasons for preferring one or the other here.
        Hide
        Mark Miller added a comment -

        No strong reason - just that we were not waiting to begin with, so I figured I wouldn't change that, just remove the interrupts. Tests did not seem to complain, so stuck with it.

        Show
        Mark Miller added a comment - No strong reason - just that we were not waiting to begin with, so I figured I wouldn't change that, just remove the interrupts. Tests did not seem to complain, so stuck with it.
        Hide
        ASF subversion and git services added a comment -

        Commit 1707066 from Mark Miller in branch 'dev/trunk'
        [ https://svn.apache.org/r1707066 ]

        SOLR-7956: Fix CHANGES entries that got mangled.

        Show
        ASF subversion and git services added a comment - Commit 1707066 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1707066 ] SOLR-7956 : Fix CHANGES entries that got mangled.

          People

          • Assignee:
            Mark Miller
            Reporter:
            Mark Miller
          • Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development