Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-9639

CdcrVersionReplicationTest failure

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.3, master (7.0)
    • Component/s: None
    • Security Level: Public (Default Security Level. Issues are Public)
    • Labels:
      None

      Description

      it fails.

      The problem is over there when it deletes that temporal collection (which is a tricky thing per se) while it's still in recovery Solr Cloud went crazy: it closes the core, and almost done it, but it can't be unloaded because PeerSync (remember, it's recovering) open it ones, and it bloat logs with

      105902 INFO (qtp3284815-656) [n:127.0.0.1:41440_ia%2Fd ] o.a.s.c.SolrCore Core collection1 is not yet closed, waiting 100 ms before checking again.

      But then, something spawn too many request /get?? which deadlocks until heap is exceeded and it dies. The fix is obvious, just to wait until recoveries finishes, before removing tmp_collection.
      Beside of this particular fix,is there any ideas about deadlock caused by deleting recovering collection?

      1. CDcr failure.txt
        10.63 MB
        Mikhail Khludnev
      2. cdcr-stack.txt
        16.08 MB
        Mikhail Khludnev
      3. cdcr-success.txt
        35 kB
        Mikhail Khludnev
      4. SOLR-9639.patch
        1 kB
        Mikhail Khludnev
      5. SOLR-9639.patch
        0.6 kB
        Mikhail Khludnev

        Issue Links

          Activity

          Hide
          mkhludnev Mikhail Khludnev added a comment -

          attaching failure log, stacktrace, and a head of successful execution.
          ant test -Dtestcase=CdcrVersionReplicationTest -Dtests.seed=374BB442DF231F4F -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=tr -Dtests.timezone=Africa/Tunis -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1

          Show
          mkhludnev Mikhail Khludnev added a comment - attaching failure log, stacktrace, and a head of successful execution. ant test -Dtestcase=CdcrVersionReplicationTest -Dtests.seed=374BB442DF231F4F -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=tr -Dtests.timezone=Africa/Tunis -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1
          Hide
          romseygeek Alan Woodward added a comment -

          I think we need a way to cancel or interrupt PeerSync from RecoveryStrategy?

          Show
          romseygeek Alan Woodward added a comment - I think we need a way to cancel or interrupt PeerSync from RecoveryStrategy?
          Hide
          mkhludnev Mikhail Khludnev added a comment -

          Don't you want to fix ci first with this one liner, and implement reasonable break at recovery then?

          Show
          mkhludnev Mikhail Khludnev added a comment - Don't you want to fix ci first with this one liner, and implement reasonable break at recovery then?
          Hide
          romseygeek Alan Woodward added a comment -

          +1 to the quick fix, let's open a separate JIRA for the cancellation.

          Show
          romseygeek Alan Woodward added a comment - +1 to the quick fix, let's open a separate JIRA for the cancellation.
          Hide
          mkhludnev Mikhail Khludnev added a comment -

          After applying this solr test runs much smoothly at my machine. Launching precommit.

          Show
          mkhludnev Mikhail Khludnev added a comment - After applying this solr test runs much smoothly at my machine. Launching precommit.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 47446733884e030feaecac355c01c58f9e5e3169 in lucene-solr's branch refs/heads/master from Mikhail Khludnev
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4744673 ]

          SOLR-9639: CDCR Tests only fix. Wait until recovery is over before
          remove the tmp_colletion.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 47446733884e030feaecac355c01c58f9e5e3169 in lucene-solr's branch refs/heads/master from Mikhail Khludnev [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4744673 ] SOLR-9639 : CDCR Tests only fix. Wait until recovery is over before remove the tmp_colletion.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 96e0c2ff48cf70f9c376760e50b78281699d0e53 in lucene-solr's branch refs/heads/branch_6x from Mikhail Khludnev
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=96e0c2f ]

          SOLR-9639: CDCR Tests only fix. Wait until recovery is over before
          remove the tmp_colletion.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 96e0c2ff48cf70f9c376760e50b78281699d0e53 in lucene-solr's branch refs/heads/branch_6x from Mikhail Khludnev [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=96e0c2f ] SOLR-9639 : CDCR Tests only fix. Wait until recovery is over before remove the tmp_colletion.
          Hide
          mkhludnev Mikhail Khludnev added a comment -

          follow up SOLR-9645

          Show
          mkhludnev Mikhail Khludnev added a comment - follow up SOLR-9645
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Closing after 6.3.0 release.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Closing after 6.3.0 release.

            People

            • Assignee:
              Unassigned
              Reporter:
              mkhludnev Mikhail Khludnev
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development