Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 6.2
    • Fix Version/s: 6.3, master (7.0)
    • Component/s: Server
    • Security Level: Public (Default Security Level. Issues are Public)
    • Labels:
    • Environment:

      Linux

    • Flags:
      Patch

      Description

      There is a bug in IndexFetcher for replication logic, it may cause deadlock issue, and it's very easy to reproduce. If you change your solrconfig to keep more than 1 commit points, this operation will causes 2 issues:
      1. Slave has to download whole index directory of Master, instead of incremental udpates only;
      2. If you click "replicate now" button manually, this is cause deadlock, stop both "indexFetcher" thread and "explicitFetcher" thread.

      The first issue is a design issue, can be worked around by keep only 1 commit point. But the second issue can always happen if there is some file located in slave's index directory, but can not be deleted by index delete policy (due to permission issue etc), I have fixed this issue for my service, would happy to contribute to Solr community to benefit others.

        Issue Links

          Activity

          Hide
          erickerickson Erick Erickson added a comment -

          Xunlong:

          It would certainly be appreciated if you went ahead and attached a patch, and note that patches are easiest to work with. Here's a bit of documentation on patches in case you are unfamiliar with them:

          https://wiki.apache.org/solr/HowToContribute#Working_With_Patches

          Thanks for reporting this!

          Erick

          Show
          erickerickson Erick Erickson added a comment - Xunlong: It would certainly be appreciated if you went ahead and attached a patch, and note that patches are easiest to work with. Here's a bit of documentation on patches in case you are unfamiliar with them: https://wiki.apache.org/solr/HowToContribute#Working_With_Patches Thanks for reporting this! Erick
          Hide
          gui.xunlong@gmail.com Xunlong added a comment -

          This can be fixed by code change:

          f45c89a52c67% git diff
          diff --git a/solr/core/src/java/org/apache/solr/handler/IndexFetcher.java b/solr/core/src/java/org/apache/solr/handler/IndexFetcher.java
          index 0244844..99f7124 100644
          — a/solr/core/src/java/org/apache/solr/handler/IndexFetcher.java
          +++ b/solr/core/src/java/org/apache/solr/handler/IndexFetcher.java
          @@ -416,11 +416,22 @@ public class IndexFetcher {
          } finally

          { writer.decref(); }
          • solrCore.getUpdateHandler().getSolrCoreState().closeIndexWriter(solrCore, true);
            +
            + // Following line is in original Solr 5.4 code base, but could possibly cause dead lock issue
            + //solrCore.getUpdateHandler().getSolrCoreState().closeIndexWriter(solrCore, true);
            }
            boolean reloadCore = false;

          try {
          + // start to download index files
          + // need to close current index writer if we are just doing incremental index files download
          + // notice this operation will occupy indexWriterLock in DefaultSolrCoreState which has to be released
          + // later by calling openIndexWriter(), otherwise, we might cause a deadlock once explicit-replication thread
          + // jump in
          + if (!isFullCopyNeeded)

          { + solrCore.getUpdateHandler().getSolrCoreState().closeIndexWriter(solrCore, true); + }

          +
          LOG.info("Starting download (fullCopy={}) to {}", isFullCopyNeeded, tmpIndexDir);
          successfulInstall = false;

          @@ -470,6 +481,8 @@ public class IndexFetcher {
          }
          }
          } finally {
          + // open a new index writer, this will also release indexWriterLock in DefaultSolrCoreState to
          + // allow other thread (like explicit-replication thread) do replication job
          if (!isFullCopyNeeded)

          { solrCore.getUpdateHandler().getSolrCoreState().openIndexWriter(solrCore); }

          The reason is that we change boolean value "isFullCopyNeeded" from "false" to "true" if there are files can not be deleted by IndexDeletionPolicy, but we still close indexWriter to occupy index writer lock because we thought "isFullCopyNeeded==false", but actually "isFullCopyNeeded==true" right now. Due to "isFullCopyNeeded==true", the following logic can not create a new indexWriter to release index writer lock, then this thread will hold the index writer lock forever. If another replication thread jump in, which will acquire "indexFetchLock" in ReplicationHandler, deadlock will happen.

          Show
          gui.xunlong@gmail.com Xunlong added a comment - This can be fixed by code change: f45c89a52c67% git diff diff --git a/solr/core/src/java/org/apache/solr/handler/IndexFetcher.java b/solr/core/src/java/org/apache/solr/handler/IndexFetcher.java index 0244844..99f7124 100644 — a/solr/core/src/java/org/apache/solr/handler/IndexFetcher.java +++ b/solr/core/src/java/org/apache/solr/handler/IndexFetcher.java @@ -416,11 +416,22 @@ public class IndexFetcher { } finally { writer.decref(); } solrCore.getUpdateHandler().getSolrCoreState().closeIndexWriter(solrCore, true); + + // Following line is in original Solr 5.4 code base, but could possibly cause dead lock issue + //solrCore.getUpdateHandler().getSolrCoreState().closeIndexWriter(solrCore, true); } boolean reloadCore = false; try { + // start to download index files + // need to close current index writer if we are just doing incremental index files download + // notice this operation will occupy indexWriterLock in DefaultSolrCoreState which has to be released + // later by calling openIndexWriter(), otherwise, we might cause a deadlock once explicit-replication thread + // jump in + if (!isFullCopyNeeded) { + solrCore.getUpdateHandler().getSolrCoreState().closeIndexWriter(solrCore, true); + } + LOG.info("Starting download (fullCopy={}) to {}", isFullCopyNeeded, tmpIndexDir); successfulInstall = false; @@ -470,6 +481,8 @@ public class IndexFetcher { } } } finally { + // open a new index writer, this will also release indexWriterLock in DefaultSolrCoreState to + // allow other thread (like explicit-replication thread) do replication job if (!isFullCopyNeeded) { solrCore.getUpdateHandler().getSolrCoreState().openIndexWriter(solrCore); } The reason is that we change boolean value "isFullCopyNeeded" from "false" to "true" if there are files can not be deleted by IndexDeletionPolicy, but we still close indexWriter to occupy index writer lock because we thought "isFullCopyNeeded==false", but actually "isFullCopyNeeded==true" right now. Due to "isFullCopyNeeded==true", the following logic can not create a new indexWriter to release index writer lock, then this thread will hold the index writer lock forever. If another replication thread jump in, which will acquire "indexFetchLock" in ReplicationHandler, deadlock will happen.
          Hide
          gui.xunlong@gmail.com Xunlong added a comment -

          Patch

          Show
          gui.xunlong@gmail.com Xunlong added a comment - Patch
          Hide
          gui.xunlong@gmail.com Xunlong added a comment -

          Hi Erick,

          Patch is attached. Thanks.

          Show
          gui.xunlong@gmail.com Xunlong added a comment - Hi Erick, Patch is attached. Thanks.
          Hide
          mbraun688 Michael Braun added a comment -

          Is this possibly the same issue I reported as SOLR-9470 or another deadlock?

          Show
          mbraun688 Michael Braun added a comment - Is this possibly the same issue I reported as SOLR-9470 or another deadlock?
          Hide
          mbraun688 Michael Braun added a comment -

          We are running this patch on top of 6.2.1 on our cluster and have not run into the issue in SOLR-9470 again. Is someone able to take a look at it to see if it can fit into the regular release?

          Show
          mbraun688 Michael Braun added a comment - We are running this patch on top of 6.2.1 on our cluster and have not run into the issue in SOLR-9470 again. Is someone able to take a look at it to see if it can fit into the regular release?
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 05f51c85f6a536915bdfad5959966c891fac6ee0 in lucene-solr's branch refs/heads/master from markrmiller
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=05f51c8 ]

          SOLR-9278: Update issue number in CHANGES to be orig rather than dupe.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 05f51c85f6a536915bdfad5959966c891fac6ee0 in lucene-solr's branch refs/heads/master from markrmiller [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=05f51c8 ] SOLR-9278 : Update issue number in CHANGES to be orig rather than dupe.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 360ac3ad26d71fedcba2550271ee4af3dc93651f in lucene-solr's branch refs/heads/branch_6x from markrmiller
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=360ac3a ]

          SOLR-9278: Update issue number in CHANGES to be orig rather than dupe.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 360ac3ad26d71fedcba2550271ee4af3dc93651f in lucene-solr's branch refs/heads/branch_6x from markrmiller [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=360ac3a ] SOLR-9278 : Update issue number in CHANGES to be orig rather than dupe.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Thanks Xunlong!

          Show
          markrmiller@gmail.com Mark Miller added a comment - Thanks Xunlong!
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Commits are accidently logged in the dupe issue FYI: SOLR-9470

          Commit ce22c2697c1342fd670e3bac460a53aef90d1d80 in lucene-solr's branch refs/heads/master from markrmiller
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ce22c26 ]
          SOLR-9470: Index replication interactions with IndexWriter can cause deadlock.

          Commit 82440f307a211d8f05fe729ec1361bcd1abd0e4e in lucene-solr's branch refs/heads/branch_6x from markrmiller
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=82440f3 ]
          SOLR-9470: Index replication interactions with IndexWriter can cause deadlock.

          Show
          markrmiller@gmail.com Mark Miller added a comment - Commits are accidently logged in the dupe issue FYI: SOLR-9470 Commit ce22c2697c1342fd670e3bac460a53aef90d1d80 in lucene-solr's branch refs/heads/master from markrmiller [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ce22c26 ] SOLR-9470 : Index replication interactions with IndexWriter can cause deadlock. Commit 82440f307a211d8f05fe729ec1361bcd1abd0e4e in lucene-solr's branch refs/heads/branch_6x from markrmiller [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=82440f3 ] SOLR-9470 : Index replication interactions with IndexWriter can cause deadlock.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Closing after 6.3.0 release.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Closing after 6.3.0 release.

            People

            • Assignee:
              markrmiller@gmail.com Mark Miller
              Reporter:
              gui.xunlong@gmail.com Xunlong
            • Votes:
              2 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 48h
                48h
                Remaining:
                Remaining Estimate - 48h
                48h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development