Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-9556

Overseer can leak threads if it starts up while its parent container is shutting down

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.3
    • Component/s: None
    • Security Level: Public (Default Security Level. Issues are Public)
    • Labels:
      None

      Description

      See https://jenkins.thetaphi.de/job/Lucene-Solr-6.x-Solaris/407/consoleFull for an example. OverseerAutoReplicaFailoverThread is particularly susceptible for some reason.

      1. SOLR-9556.patch
        0.6 kB
        Alan Woodward

        Activity

        Hide
        romseygeek Alan Woodward added a comment -

        I've stepped through the code and I can't work out where the leak actually happens, but it seems that if the Overseer thread is started by its ElectionContext after everything else is shut down, then Overseer.close() doesn't get called. I think there are a couple of things we can do here:
        1) Close the election threads right at the beginning of CoreContainer shutdown. This should help prevent spurious leader elections on closing nodes.
        2) Always quit Overseer threads on interrupt. At the moment they check to see if they're closed first, but is there really any situation in which a thread is interrupted but it shouldn't then exit?

        Show
        romseygeek Alan Woodward added a comment - I've stepped through the code and I can't work out where the leak actually happens, but it seems that if the Overseer thread is started by its ElectionContext after everything else is shut down, then Overseer.close() doesn't get called. I think there are a couple of things we can do here: 1) Close the election threads right at the beginning of CoreContainer shutdown. This should help prevent spurious leader elections on closing nodes. 2) Always quit Overseer threads on interrupt. At the moment they check to see if they're closed first, but is there really any situation in which a thread is interrupted but it shouldn't then exit?
        Hide
        romseygeek Alan Woodward added a comment -
        Show
        romseygeek Alan Woodward added a comment - Another test failure caused by this: https://jenkins.thetaphi.de/job/Lucene-Solr-6.x-Linux/1827/
        Hide
        romseygeek Alan Woodward added a comment -

        Simple fix, just ensuring that OverseerAutoReplicaFailoverThread exits if it's interrupted.

        I experimented with closing the Overseer earlier in shutdown, but that doesn't really work because we always need an Overseer running to publish DOWN state. So the final container needs to take over the overseer role just so that it can shut down cleanly.

        Show
        romseygeek Alan Woodward added a comment - Simple fix, just ensuring that OverseerAutoReplicaFailoverThread exits if it's interrupted. I experimented with closing the Overseer earlier in shutdown, but that doesn't really work because we always need an Overseer running to publish DOWN state. So the final container needs to take over the overseer role just so that it can shut down cleanly.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1f0d75b802f2968703692fe4b8c82b70ba851cea in lucene-solr's branch refs/heads/branch_6x from Alan Woodward
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1f0d75b ]

        SOLR-9556: Exit failover thread on interrupt

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1f0d75b802f2968703692fe4b8c82b70ba851cea in lucene-solr's branch refs/heads/branch_6x from Alan Woodward [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1f0d75b ] SOLR-9556 : Exit failover thread on interrupt
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit ef747c8445a5e3d698f7f02777c528883351f293 in lucene-solr's branch refs/heads/master from Alan Woodward
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ef747c8 ]

        SOLR-9556: Exit failover thread on interrupt

        Show
        jira-bot ASF subversion and git services added a comment - Commit ef747c8445a5e3d698f7f02777c528883351f293 in lucene-solr's branch refs/heads/master from Alan Woodward [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ef747c8 ] SOLR-9556 : Exit failover thread on interrupt
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        Closing after 6.3.0 release.

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - Closing after 6.3.0 release.

          People

          • Assignee:
            Unassigned
            Reporter:
            romseygeek Alan Woodward
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development