Solr
  1. Solr
  2. SOLR-5397

Replication can fail silently in some cases.

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.6, Trunk
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      Error Message:
      shard3 is not consistent. Got 137 from http://127.0.0.1:49317/collection1lastClient and got 125 from http://127.0.0.1:39794/collection1

      1. SOLR-5397.patch
        10 kB
        Mark Miller
      2. monkeyrun.log
        2.38 MB
        Mark Miller

        Issue Links

          Activity

          Hide
          Mark Miller added a comment -

          Seems to happen on runs when we don't even kill any servers or cause any connection disruption, etc.

          Show
          Mark Miller added a comment - Seems to happen on runs when we don't even kill any servers or cause any connection disruption, etc.
          Hide
          ASF subversion and git services added a comment -

          Commit 1536499 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1536499 ]

          SOLR-5397: Move the call to obtain a ConcucrrentSolrServer into try catch block

          Show
          ASF subversion and git services added a comment - Commit 1536499 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1536499 ] SOLR-5397 : Move the call to obtain a ConcucrrentSolrServer into try catch block
          Hide
          ASF subversion and git services added a comment -

          Commit 1536500 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1536500 ]

          SOLR-5397: Move the call to obtain a ConcucrrentSolrServer into try catch block

          Show
          ASF subversion and git services added a comment - Commit 1536500 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1536500 ] SOLR-5397 : Move the call to obtain a ConcucrrentSolrServer into try catch block
          Hide
          ASF subversion and git services added a comment -

          Commit 1536511 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1536511 ]

          SOLR-5397: If a retry fails, always finish the rest of the retries

          Show
          ASF subversion and git services added a comment - Commit 1536511 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1536511 ] SOLR-5397 : If a retry fails, always finish the rest of the retries
          Hide
          ASF subversion and git services added a comment -

          Commit 1536513 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1536513 ]

          SOLR-5397: If a retry fails, always finish the rest of the retries

          Show
          ASF subversion and git services added a comment - Commit 1536513 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1536513 ] SOLR-5397 : If a retry fails, always finish the rest of the retries
          Hide
          Mark Miller added a comment -

          This is a relatively shorter run with the fail. No disruption is occurring in this run due to SOLR-5417.

          Show
          Mark Miller added a comment - This is a relatively shorter run with the fail. No disruption is occurring in this run due to SOLR-5417 .
          Hide
          Mark Miller added a comment -

          I think I might be making some headway on this...will know more tomorrow.

          Show
          Mark Miller added a comment - I think I might be making some headway on this...will know more tomorrow.
          Hide
          Mark Miller added a comment -

          So I don't yet know 100% what this is, but I have some indications and my current guess is this is actually existing, just wast easily caught in a test until we moved from batching to streaming updates.

          Show
          Mark Miller added a comment - So I don't yet know 100% what this is, but I have some indications and my current guess is this is actually existing, just wast easily caught in a test until we moved from batching to streaming updates.
          Hide
          Mark Miller added a comment -

          Getting fairly confident this is due to replication - it's not replicating when it should - I think there might be a recent bug there. I seem to be able to work around it by going back to the old style of forcing a replication no matter what.

          Show
          Mark Miller added a comment - Getting fairly confident this is due to replication - it's not replicating when it should - I think there might be a recent bug there. I seem to be able to work around it by going back to the old style of forcing a replication no matter what.
          Hide
          Mark Miller added a comment -

          Well, I'd like to get this in for 4.6, but I'm taking a break and it digging into replication issues is a bunch more slog work.

          A tradeoff would be to change 4.6 back to replicating all the time - some cases painful, but always correct.

          Show
          Mark Miller added a comment - Well, I'd like to get this in for 4.6, but I'm taking a break and it digging into replication issues is a bunch more slog work. A tradeoff would be to change 4.6 back to replicating all the time - some cases painful, but always correct.
          Hide
          Mark Miller added a comment -

          Oh good...this might not be such a deep dive needed...

          Show
          Mark Miller added a comment - Oh good...this might not be such a deep dive needed...
          Hide
          Mark Miller added a comment -

          Strange note: all of the cases that the replication does not happen correctly seems to involve only replicating .del files as new files.

          Show
          Mark Miller added a comment - Strange note: all of the cases that the replication does not happen correctly seems to involve only replicating .del files as new files.
          Hide
          Mark Miller added a comment -

          Starting to look very similar to SOLR-4926

          Show
          Mark Miller added a comment - Starting to look very similar to SOLR-4926
          Hide
          Mark Miller added a comment -

          Have things fairly isolated now. Pretty odd - we skip downloading all these index files to what should be a new tmp dir, but the claim is they already exist - that doesn't seem to hold up because when we try and open on the resulting index (of which we only copied del files and a new segments_n file into) we don't find one of the skipped index files and so it fails back to the previous gen, which is empty. Weird, but should be able to get to the bottom of it.

          org.apache.lucene.store.MMapDirectory@/ext3space/jenkins/workspace/ChaosMonkeySafeLeader/solr/build/solr-core/test/J0/org.apache.solr.cloud.ChaosMonkeySafeLeaderTest-1384726566279/jetty10/index.80120827121941846 lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@73c8693 fullCopy=false
             [junit4]   2> 76387 T210 C18 P57580 oash.SnapPuller.downloadIndexFiles Skipping download for _2.si because it already exists
             [junit4]   2> 76387 T210 C18 P57580 oash.SnapPuller.downloadIndexFiles Skipping download for _0.cfe because it already exists
             [junit4]   2> 76403 T210 C18 P57580 oash.SnapPuller.downloadIndexFiles Skipping download for _2.cfs because it already exists
             [junit4]   2> 76404 T210 C18 P57580 oash.SnapPuller.downloadIndexFiles Skipping download for _1.cfe because it already exists
             [junit4]   2> 76406 T210 C18 P57580 oash.SnapPuller.downloadIndexFiles Skipping download for _0.cfs because it already exists
             [junit4]   2> 76407 T210 C18 P57580 oash.SnapPuller.downloadIndexFiles Skipping download for _1.si because it already exists
             [junit4]   2> 76411 T210 C18 P57580 oash.SnapPuller.downloadIndexFiles Skipping download for _0.si because it already exists
             [junit4]   2> 76412 T210 C18 P57580 oash.SnapPuller.downloadIndexFiles Skipping download for _1.cfs because it already exists
             [junit4]   2> 76419 T210 C18 P57580 oash.SnapPuller.downloadIndexFiles Skipping download for _2_1.del because it already exists
             [junit4]   2> 76420 T210 C18 P57580 oash.SnapPuller.downloadIndexFiles Skipping download for _2.cfe because it already exists
             [junit4]   2> 76420 T210 C18 P57580 oash.SnapPuller.fetchLatestIndex Total time taken for download : 0 secs
             [junit4]   2> 76756 T210 C18 P57580 oash.SnapPuller.moveAFile DEBUG Moving file: _1_1.del
             [junit4]   2> 76757 T210 C18 P57580 oash.SnapPuller.moveAFile DEBUG Moving file: _0_2.del
             [junit4]   2> 76758 T210 C18 P57580 oash.SnapPuller.moveAFile DEBUG Moving file: segments_2
             [junit4]   2> java.io.FileNotFoundException: /ext3space/jenkins/workspace/ChaosMonkeySafeLeader/solr/build/solr-core/test/J0/org.apache.solr.cloud.ChaosMonkeySafeLeaderTest-1384726566279/jetty10/index/_0.si (No such file or directory)
             [junit4]   2> 	at java.io.RandomAccessFile.open(Native Method)
             [junit4]   2> 	at java.io.RandomAccessFile.<init>(RandomAccessFile.java:233)
             [junit4]   2> 	at org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:193)
             [junit4]   2> 	at org.apache.lucene.codecs.lucene40.Lucene40SegmentInfoReader.read(Lucene40SegmentInfoReader.java:51)
             [junit4]   2> 	at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:340)
             [junit4]   2> 	at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:404)
             [junit4]   2> 	at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:843)
             [junit4]   2> 	at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:694)
             [junit4]   2> 	at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:400)
             [junit4]   2> 	at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:741)
             [junit4]   2> 	at org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:77)
             [junit4]   2> 	at org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:64)
             [junit4]   2> 	at org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:267)
             [junit4]   2> 	at org.apache.solr.update.DefaultSolrCoreState.openIndexWriter(DefaultSolrCoreState.java:250)
             [junit4]   2> 	at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:458)
             [junit4]   2> 	at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:322)
             [junit4]   2> 	at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:155)
             [junit4]   2> 	at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:435)
             [junit4]   2> 	at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:246)
             [junit4]   2> java.io.FileNotFoundException: /ext3space/jenkins/workspace/ChaosMonkeySafeLeader/solr/build/solr-core/test/J0/org.apache.solr.cloud.ChaosMonkeySafeLeaderTest-1384726566279/jetty10/index/_0.si (No such file or directory)
             [junit4]   2> 	at java.io.RandomAccessFile.open(Native Method)
          
          Show
          Mark Miller added a comment - Have things fairly isolated now. Pretty odd - we skip downloading all these index files to what should be a new tmp dir, but the claim is they already exist - that doesn't seem to hold up because when we try and open on the resulting index (of which we only copied del files and a new segments_n file into) we don't find one of the skipped index files and so it fails back to the previous gen, which is empty. Weird, but should be able to get to the bottom of it. org.apache.lucene.store.MMapDirectory@/ext3space/jenkins/workspace/ChaosMonkeySafeLeader/solr/build/solr-core/test/J0/org.apache.solr.cloud.ChaosMonkeySafeLeaderTest-1384726566279/jetty10/index.80120827121941846 lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@73c8693 fullCopy=false [junit4] 2> 76387 T210 C18 P57580 oash.SnapPuller.downloadIndexFiles Skipping download for _2.si because it already exists [junit4] 2> 76387 T210 C18 P57580 oash.SnapPuller.downloadIndexFiles Skipping download for _0.cfe because it already exists [junit4] 2> 76403 T210 C18 P57580 oash.SnapPuller.downloadIndexFiles Skipping download for _2.cfs because it already exists [junit4] 2> 76404 T210 C18 P57580 oash.SnapPuller.downloadIndexFiles Skipping download for _1.cfe because it already exists [junit4] 2> 76406 T210 C18 P57580 oash.SnapPuller.downloadIndexFiles Skipping download for _0.cfs because it already exists [junit4] 2> 76407 T210 C18 P57580 oash.SnapPuller.downloadIndexFiles Skipping download for _1.si because it already exists [junit4] 2> 76411 T210 C18 P57580 oash.SnapPuller.downloadIndexFiles Skipping download for _0.si because it already exists [junit4] 2> 76412 T210 C18 P57580 oash.SnapPuller.downloadIndexFiles Skipping download for _1.cfs because it already exists [junit4] 2> 76419 T210 C18 P57580 oash.SnapPuller.downloadIndexFiles Skipping download for _2_1.del because it already exists [junit4] 2> 76420 T210 C18 P57580 oash.SnapPuller.downloadIndexFiles Skipping download for _2.cfe because it already exists [junit4] 2> 76420 T210 C18 P57580 oash.SnapPuller.fetchLatestIndex Total time taken for download : 0 secs [junit4] 2> 76756 T210 C18 P57580 oash.SnapPuller.moveAFile DEBUG Moving file: _1_1.del [junit4] 2> 76757 T210 C18 P57580 oash.SnapPuller.moveAFile DEBUG Moving file: _0_2.del [junit4] 2> 76758 T210 C18 P57580 oash.SnapPuller.moveAFile DEBUG Moving file: segments_2 [junit4] 2> java.io.FileNotFoundException: /ext3space/jenkins/workspace/ChaosMonkeySafeLeader/solr/build/solr-core/test/J0/org.apache.solr.cloud.ChaosMonkeySafeLeaderTest-1384726566279/jetty10/index/_0.si (No such file or directory) [junit4] 2> at java.io.RandomAccessFile.open(Native Method) [junit4] 2> at java.io.RandomAccessFile.<init>(RandomAccessFile.java:233) [junit4] 2> at org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:193) [junit4] 2> at org.apache.lucene.codecs.lucene40.Lucene40SegmentInfoReader.read(Lucene40SegmentInfoReader.java:51) [junit4] 2> at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:340) [junit4] 2> at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:404) [junit4] 2> at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:843) [junit4] 2> at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:694) [junit4] 2> at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:400) [junit4] 2> at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:741) [junit4] 2> at org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:77) [junit4] 2> at org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:64) [junit4] 2> at org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:267) [junit4] 2> at org.apache.solr.update.DefaultSolrCoreState.openIndexWriter(DefaultSolrCoreState.java:250) [junit4] 2> at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:458) [junit4] 2> at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:322) [junit4] 2> at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:155) [junit4] 2> at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:435) [junit4] 2> at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:246) [junit4] 2> java.io.FileNotFoundException: /ext3space/jenkins/workspace/ChaosMonkeySafeLeader/solr/build/solr-core/test/J0/org.apache.solr.cloud.ChaosMonkeySafeLeaderTest-1384726566279/jetty10/index/_0.si (No such file or directory) [junit4] 2> at java.io.RandomAccessFile.open(Native Method)
          Hide
          Mark Miller added a comment -

          I think I've got it - just have to test it out. We actually roll back the writer on the main index before we copy the files from the tmp index into - but that rollback can actually end up removing files - so we might have skipped some files thinking they were already in the std index (because they were), but then the rollback can remove them. This causes a corrupt index and opening a reader on it rolls back to the first empty commit.

          This has been around for some time, but I think recent changes allowed it to be exposed in tests.

          Show
          Mark Miller added a comment - I think I've got it - just have to test it out. We actually roll back the writer on the main index before we copy the files from the tmp index into - but that rollback can actually end up removing files - so we might have skipped some files thinking they were already in the std index (because they were), but then the rollback can remove them. This causes a corrupt index and opening a reader on it rolls back to the first empty commit. This has been around for some time, but I think recent changes allowed it to be exposed in tests.
          Hide
          Mark Miller added a comment -

          Here is a patch for an idea I'm trying out.

          Show
          Mark Miller added a comment - Here is a patch for an idea I'm trying out.
          Hide
          Mark Miller added a comment -

          This is testing very nicely - I'll let it run while I'm gone and commit it when I get home.

          Show
          Mark Miller added a comment - This is testing very nicely - I'll let it run while I'm gone and commit it when I get home.
          Hide
          ASF subversion and git services added a comment -

          Commit 1542884 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1542884 ]

          SOLR-5397: Replication can fail silently in some cases.

          Show
          ASF subversion and git services added a comment - Commit 1542884 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1542884 ] SOLR-5397 : Replication can fail silently in some cases.
          Hide
          ASF subversion and git services added a comment -

          Commit 1542885 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1542885 ]

          SOLR-5397: Replication can fail silently in some cases.

          Show
          ASF subversion and git services added a comment - Commit 1542885 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1542885 ] SOLR-5397 : Replication can fail silently in some cases.
          Hide
          ASF subversion and git services added a comment -

          Commit 1542890 from Mark Miller in branch 'dev/branches/lucene_solr_4_6'
          [ https://svn.apache.org/r1542890 ]

          SOLR-5397: Replication can fail silently in some cases.

          Show
          ASF subversion and git services added a comment - Commit 1542890 from Mark Miller in branch 'dev/branches/lucene_solr_4_6' [ https://svn.apache.org/r1542890 ] SOLR-5397 : Replication can fail silently in some cases.

            People

            • Assignee:
              Mark Miller
              Reporter:
              Mark Miller
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development