Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-9836

Add more graceful recovery steps when failing to create SolrCore

    Details

    • Type: Bug
    • Status: Reopened
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 7.0, 6.7
    • Component/s: SolrCloud
    • Security Level: Public (Default Security Level. Issues are Public)
    • Labels:
      None

      Description

      I have seen several cases where there is a zero-length segments_n file. We haven't identified the root cause of these issues (possibly a poorly timed crash during replication?) but if there is another node available then Solr should be able to recover from this situation. Currently, we log and give up on loading that core, leaving the user to manually intervene.

      1. SOLR-9836.patch
        27 kB
        Mike Drob
      2. SOLR-9836.patch
        27 kB
        Mike Drob
      3. SOLR-9836.patch
        27 kB
        Mike Drob
      4. SOLR-9836.patch
        26 kB
        Mike Drob
      5. SOLR-9836.patch
        26 kB
        Mike Drob
      6. SOLR-9836.patch
        29 kB
        Mike Drob
      7. SOLR-9836.patch
        16 kB
        Mike Drob

        Issue Links

          Activity

          Hide
          mdrob Mike Drob added a comment -

          Attaching a first attempt at improving this behaviour.

          Some open discussion points -

          • Do we want to expand this for other types of failures? I think yes, but in a future iteration/JIRA.
          • Would it be safe to add directoryFactory.doneWithDirectory() to modifyIndexProps
          • Should modifyIndexProps stay in IndexFetcher or move somewhere more generic?
          Show
          mdrob Mike Drob added a comment - Attaching a first attempt at improving this behaviour. Some open discussion points - Do we want to expand this for other types of failures? I think yes, but in a future iteration/JIRA. Would it be safe to add directoryFactory.doneWithDirectory() to modifyIndexProps Should modifyIndexProps stay in IndexFetcher or move somewhere more generic?
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          We probably want to be able to turn it off. Some users may want to ability to use check index and try to salvage what they can in corruption cases.

          I'm not sure that is the right exception to catch - very brittle. We should probably be mostly looking for CorruptedIndexException and if that doesn't cover a case at the Lucene level, look at improving that there. Even if the case of a 0 byte segments file with nothing to roll back on throws an EOFException today, it may not tomorrow. I think that is the goal of the CorruptIndexException - you can actually have a little more than momentary confidence that your code is not treating exceptions one way while things change underneath you over time.

          Would it be safe to add directoryFactory.doneWithDirectory() to modifyIndexProps

          directoryFactory.doneWithDirectory is for the case where you are done with the directory and it can now be deleted if need be - you won't access it again.

          Should modifyIndexProps stay in IndexFetcher or move somewhere more generic?

          I have not looked yet, but may make more sense in SolrCore or something.

          Show
          markrmiller@gmail.com Mark Miller added a comment - We probably want to be able to turn it off. Some users may want to ability to use check index and try to salvage what they can in corruption cases. I'm not sure that is the right exception to catch - very brittle. We should probably be mostly looking for CorruptedIndexException and if that doesn't cover a case at the Lucene level, look at improving that there. Even if the case of a 0 byte segments file with nothing to roll back on throws an EOFException today, it may not tomorrow. I think that is the goal of the CorruptIndexException - you can actually have a little more than momentary confidence that your code is not treating exceptions one way while things change underneath you over time. Would it be safe to add directoryFactory.doneWithDirectory() to modifyIndexProps directoryFactory.doneWithDirectory is for the case where you are done with the directory and it can now be deleted if need be - you won't access it again. Should modifyIndexProps stay in IndexFetcher or move somewhere more generic? I have not looked yet, but may make more sense in SolrCore or something.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          I think initially it would be good to offer three options.

          • No action
          • Recovery from leader on a CorruptedIndexException
          • Only recover from leader if the segments file is not readable and there is none to fall back to
          Show
          markrmiller@gmail.com Mark Miller added a comment - I think initially it would be good to offer three options. No action Recovery from leader on a CorruptedIndexException Only recover from leader if the segments file is not readable and there is none to fall back to
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          directoryFactory.doneWithDirectory is for the case where you are done with the directory and it can now be deleted if need be

          Should not have said deleted, it was just on my mind due to the ephemeral directory factories, but you can expect data to survive if you have not specified options otherwise. It's for one off directories though, or directories you are moving on from - once their reference counts hit 0, they can be let out of the cache.

          Show
          markrmiller@gmail.com Mark Miller added a comment - directoryFactory.doneWithDirectory is for the case where you are done with the directory and it can now be deleted if need be Should not have said deleted, it was just on my mind due to the ephemeral directory factories, but you can expect data to survive if you have not specified options otherwise. It's for one off directories though, or directories you are moving on from - once their reference counts hit 0, they can be let out of the cache.
          Hide
          mdrob Mike Drob added a comment -

          I'm not sure that is the right exception to catch - very brittle. We should probably be mostly looking for CorruptedIndexException and if that doesn't cover a case at the Lucene level, look at improving that there. Even if the case of a 0 byte segments file with nothing to roll back on throws an EOFException today, it may not tomorrow. I think that is the goal of the CorruptIndexException - you can actually have a little more than momentary confidence that your code is not treating exceptions one way while things change underneath you over time.

          I could add a check somewhere along the chain that would turn an EOF into a CorruptIndex. However, I'm not confident enough in the lucene internals to know if this leads to eventual false positives somewhere... It probably looks like:

          SegmentInfos.java:276
               long generation = generationFromSegmentsFileName(segmentFileName);
               //System.out.println(Thread.currentThread() + ": SegmentInfos.readCommit " + segmentFileName);
          +    ChecksumIndexInput saved = null;
               try (ChecksumIndexInput input = directory.openChecksumInput(segmentFileName, IOContext.READ)) {
          +      saved = input;
                 return readCommit(directory, input, generation);
          +    } catch (EOFException e) {
          +      throw new CorruptIndexException("Unexpected end of file while reading index.", saved, e);
               }
             }
          

          But the method javadoc worries me: * Read a particular segmentFileName. Note that this may throw an IOException if a commit is in process.
          Under what circumstances would this throw an IOException? Randomly returning CorruptIndex during normal operation is bad news.

          Show
          mdrob Mike Drob added a comment - I'm not sure that is the right exception to catch - very brittle. We should probably be mostly looking for CorruptedIndexException and if that doesn't cover a case at the Lucene level, look at improving that there. Even if the case of a 0 byte segments file with nothing to roll back on throws an EOFException today, it may not tomorrow. I think that is the goal of the CorruptIndexException - you can actually have a little more than momentary confidence that your code is not treating exceptions one way while things change underneath you over time. I could add a check somewhere along the chain that would turn an EOF into a CorruptIndex . However, I'm not confident enough in the lucene internals to know if this leads to eventual false positives somewhere... It probably looks like: SegmentInfos.java:276 long generation = generationFromSegmentsFileName(segmentFileName); // System .out.println( Thread .currentThread() + ": SegmentInfos.readCommit " + segmentFileName); + ChecksumIndexInput saved = null ; try (ChecksumIndexInput input = directory.openChecksumInput(segmentFileName, IOContext.READ)) { + saved = input; return readCommit(directory, input, generation); + } catch (EOFException e) { + throw new CorruptIndexException( "Unexpected end of file while reading index." , saved, e); } } But the method javadoc worries me: * Read a particular segmentFileName. Note that this may throw an IOException if a commit is in process. Under what circumstances would this throw an IOException? Randomly returning CorruptIndex during normal operation is bad news.
          Hide
          mdrob Mike Drob added a comment -

          Filed LUCENE-7592 to deal with the exception throwing issue.

          Show
          mdrob Mike Drob added a comment - Filed LUCENE-7592 to deal with the exception throwing issue.
          Hide
          mdrob Mike Drob added a comment -

          Current WIP patch.

          • Moved modifyIndexProps to SolrCore
          • Added system property toggle for controlling desired behaviour here.
            • Property name and values are shots in the dark and by no means final
            • Used an enum because it made sense logically at the time, not sure if this actually matters.
          • Switched to looking for CorruptIndexException
          • Fall back to earlier segments file implementation is missing, pending some questions below. (there's a unit test though)
            • It's very hard to tell if it was actually the segments file that is corrupt, or if it was something else.
            • Is it sufficient to delete segments_n and let lucene try to read from the new "latest" commit? Will this screw up replication? Do we need to update the generation anywhere else? And I'm still nervous about indiscriminately deleting files where recovery might be possible. I guess that's the point of the config options.
            • Another option is to hack a FilterDirectory on the index that would hide the latest segments_n file instead of deleting it. That might work to open it, but we will likely end up with write conflicts next time we commit.

          The more I toss this idea around, the more it feels like something that would be more cleanly handled at the Lucene level. Possibly best to have two options (recover from leader, do nothing) instead of the initial three proposed by Mark Miller and expand on them later.

          Show
          mdrob Mike Drob added a comment - Current WIP patch. Moved modifyIndexProps to SolrCore Added system property toggle for controlling desired behaviour here. Property name and values are shots in the dark and by no means final Used an enum because it made sense logically at the time, not sure if this actually matters. Switched to looking for CorruptIndexException Fall back to earlier segments file implementation is missing, pending some questions below. (there's a unit test though) It's very hard to tell if it was actually the segments file that is corrupt, or if it was something else. Is it sufficient to delete segments_n and let lucene try to read from the new "latest" commit? Will this screw up replication? Do we need to update the generation anywhere else? And I'm still nervous about indiscriminately deleting files where recovery might be possible. I guess that's the point of the config options. Another option is to hack a FilterDirectory on the index that would hide the latest segments_n file instead of deleting it. That might work to open it, but we will likely end up with write conflicts next time we commit. The more I toss this idea around, the more it feels like something that would be more cleanly handled at the Lucene level. Possibly best to have two options (recover from leader, do nothing) instead of the initial three proposed by Mark Miller and expand on them later.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Fall back to earlier segments file implementation is missing

          This should already be Lucene's behavior. I assume if it's not falling back it's because there is no previous segments file to fall back to.

          Show
          markrmiller@gmail.com Mark Miller added a comment - Fall back to earlier segments file implementation is missing This should already be Lucene's behavior. I assume if it's not falling back it's because there is no previous segments file to fall back to.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Possibly best to have two options

          The third option is not very difficult. Lucene already loads the last segments file it can. So if we get a corrupt index, we can just sanity check that the segments file can be loaded. If it can't, we can't fix things anyway, so recover. If the segments file looks fine, don't recover because the index could be corrected.

          Show
          markrmiller@gmail.com Mark Miller added a comment - Possibly best to have two options The third option is not very difficult. Lucene already loads the last segments file it can. So if we get a corrupt index, we can just sanity check that the segments file can be loaded. If it can't, we can't fix things anyway, so recover. If the segments file looks fine, don't recover because the index could be corrected.
          Hide
          mdrob Mike Drob added a comment -

          This should already be Lucene's behavior. I assume if it's not falling back it's because there is no previous segments file to fall back to.

          I didn't see Lucene doing this. Or at least, I didn't see Solr leverage Lucene to do this. Both through manual inspection of the code and through testing via MissingSegmentRecoveryTest::testRollback in my patch.

          Show
          mdrob Mike Drob added a comment - This should already be Lucene's behavior. I assume if it's not falling back it's because there is no previous segments file to fall back to. I didn't see Lucene doing this. Or at least, I didn't see Solr leverage Lucene to do this. Both through manual inspection of the code and through testing via MissingSegmentRecoveryTest::testRollback in my patch.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          I believe it's in SegmentInfos->FindSegmentsFile. We can leave the third option for another JIRA though.

          Show
          markrmiller@gmail.com Mark Miller added a comment - I believe it's in SegmentInfos->FindSegmentsFile. We can leave the third option for another JIRA though.
          Hide
          mdrob Mike Drob added a comment -

          I believe it's in SegmentInfos->FindSegmentsFile

          That only goes forward in case of concurrent commits, I don't see it ever falling back to an older segments.

          Changes in latest patch:

          • Completely drop attempts to open older segments. Leaving it for future work.
          • Added javadocs.
          • Preserve original exception in case there is still a problem the second time we create SolrCore

          MissingSegmentRecoveryTest takes ~45 seconds to run on my machine. Is this long enough that it deserves a @Slow annotation?

          Show
          mdrob Mike Drob added a comment - I believe it's in SegmentInfos->FindSegmentsFile That only goes forward in case of concurrent commits, I don't see it ever falling back to an older segments. Changes in latest patch: Completely drop attempts to open older segments. Leaving it for future work. Added javadocs. Preserve original exception in case there is still a problem the second time we create SolrCore MissingSegmentRecoveryTest takes ~45 seconds to run on my machine. Is this long enough that it deserves a @Slow annotation?
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          That only goes forward in case of concurrent commits, I don't see it ever falling back to an older segments.

          That would be very surprising to me. A lot of code has changed in this area since I've looked at it closer, but if Lucene crashes while writing the segments file and then can't load the index rather than falling back to the last successful commit, I'd be really surprised. I'm not current on the strategy for that situation though.

          Show
          markrmiller@gmail.com Mark Miller added a comment - That only goes forward in case of concurrent commits, I don't see it ever falling back to an older segments. That would be very surprising to me. A lot of code has changed in this area since I've looked at it closer, but if Lucene crashes while writing the segments file and then can't load the index rather than falling back to the last successful commit, I'd be really surprised. I'm not current on the strategy for that situation though.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Hmm, okay, it looks like these days we count on a file rename to publish the segments file, so that case is avoided all together. We can just still probably see it because we have replication and external things like that.

          Show
          markrmiller@gmail.com Mark Miller added a comment - Hmm, okay, it looks like these days we count on a file rename to publish the segments file, so that case is avoided all together. We can just still probably see it because we have replication and external things like that.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Is this long enough that it deserves a @Slow annotation?

          Yes.

          Have you run the full test suite as well yet?

          Show
          markrmiller@gmail.com Mark Miller added a comment - Is this long enough that it deserves a @Slow annotation? Yes. Have you run the full test suite as well yet?
          Hide
          mdrob Mike Drob added a comment -

          Have you run the full test suite as well yet?

          I'm getting some errors related to backup/restore operations that go away if I modify those tests to useFactory("solr.StandardDirectoryFactory") instead of RAM/Mock. Will look into this further and then post a new patch when I figure it out.

          Show
          mdrob Mike Drob added a comment - Have you run the full test suite as well yet? I'm getting some errors related to backup/restore operations that go away if I modify those tests to useFactory("solr.StandardDirectoryFactory") instead of RAM/Mock. Will look into this further and then post a new patch when I figure it out.
          Hide
          mdrob Mike Drob added a comment -

          version 4:

          • Rebased patch onto master, incorporating changes from SOLR-9859.
          • Addressed failing tests.
          Show
          mdrob Mike Drob added a comment - version 4: Rebased patch onto master, incorporating changes from SOLR-9859 . Addressed failing tests.
          Hide
          mdrob Mike Drob added a comment -

          Patch #5 - rebase based on changes from solr metrics.

          Show
          mdrob Mike Drob added a comment - Patch #5 - rebase based on changes from solr metrics.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          writeNewIndexProps no longer has the correct impl - needs to be properly merged up.

          Show
          markrmiller@gmail.com Mark Miller added a comment - writeNewIndexProps no longer has the correct impl - needs to be properly merged up.
          Hide
          mdrob Mike Drob added a comment -

          Yep, you're right, the delete snuck in. Thanks for looking.

          Show
          mdrob Mike Drob added a comment - Yep, you're right, the delete snuck in. Thanks for looking.
          Hide
          mdrob Mike Drob added a comment -

          I'm running some tests, but getting failures from I think SOLR-9928. And also some failures in CoreAdminRequestStatusTest, but get those without my patch applied also. Will try to track all of these down.

          Show
          mdrob Mike Drob added a comment - I'm running some tests, but getting failures from I think SOLR-9928 . And also some failures in CoreAdminRequestStatusTest, but get those without my patch applied also. Will try to track all of these down.
          Hide
          mdrob Mike Drob added a comment -

          Never mind, the CoreAdminRequestStatusTest failures I saw were due to an environment issue on my end. I think everything is passing for me now.

          Show
          mdrob Mike Drob added a comment - Never mind, the CoreAdminRequestStatusTest failures I saw were due to an environment issue on my end. I think everything is passing for me now.
          Hide
          mdrob Mike Drob added a comment -

          Attaching the patch rebased onto latest master, since the current one wouldn't apply cleanly anymore.

          Show
          mdrob Mike Drob added a comment - Attaching the patch rebased onto latest master, since the current one wouldn't apply cleanly anymore.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Thanks Mike, I'm getting this in today.

          Show
          markrmiller@gmail.com Mark Miller added a comment - Thanks Mike, I'm getting this in today.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Okay, I'm pretty happy with this patch. I'll commit tomorrow to give anyone else with an interest a chance to weigh in.

          Show
          markrmiller@gmail.com Mark Miller added a comment - Okay, I'm pretty happy with this patch. I'll commit tomorrow to give anyone else with an interest a chance to weigh in.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit a89560bb72de57d291db45c52c04b9edf6c91d92 in lucene-solr's branch refs/heads/master from markrmiller
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a89560b ]

          SOLR-9836: Add ability to recover from leader when index corruption is detected on SolrCore creation.

          Show
          jira-bot ASF subversion and git services added a comment - Commit a89560bb72de57d291db45c52c04b9edf6c91d92 in lucene-solr's branch refs/heads/master from markrmiller [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a89560b ] SOLR-9836 : Add ability to recover from leader when index corruption is detected on SolrCore creation.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 59d7bc5ede7cf4d50b5efb79b31bc0343d6f10dc in lucene-solr's branch refs/heads/branch_6x from markrmiller
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=59d7bc5 ]

          SOLR-9836: Add ability to recover from leader when index corruption is detected on SolrCore creation.

          1. Conflicts:
          2. solr/CHANGES.txt
          3. solr/core/src/java/org/apache/solr/core/CoreContainer.java
          Show
          jira-bot ASF subversion and git services added a comment - Commit 59d7bc5ede7cf4d50b5efb79b31bc0343d6f10dc in lucene-solr's branch refs/heads/branch_6x from markrmiller [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=59d7bc5 ] SOLR-9836 : Add ability to recover from leader when index corruption is detected on SolrCore creation. Conflicts: solr/CHANGES.txt solr/core/src/java/org/apache/solr/core/CoreContainer.java
          Hide
          hossman Hoss Man added a comment -

          FWIW: erick found a diff situation where corruption prevented solr from being able to do a full fetch index to recover that is evidently not solved by this existing fix: SOLR-10006

          Show
          hossman Hoss Man added a comment - FWIW: erick found a diff situation where corruption prevented solr from being able to do a full fetch index to recover that is evidently not solved by this existing fix: SOLR-10006
          Hide
          steve_rowe Steve Rowe added a comment - - edited

          MissingSegmentRecoveryTest.testLeaderRecovery() has been failing pretty regularly on Jenkins. Something happened on or about February 10th, when the probability of failure went up considerably (and has since remained at this elevated level).

          I got 3 failures beasting 100 iterations of the test suite using Miller's beasting script on my box. However, for the past three weeks I've gotten several failures a day on my Jenkins, and roughly once a day on either ASF or Policeman Jenkins.

          Here's a recent failure https://builds.apache.org/job/Lucene-Solr-Tests-master/1699/:

            [junit4]   2> 599977 ERROR (coreLoadExecutor-3254-thread-1-processing-n:127.0.0.1:41308_solr) [n:127.0.0.1:41308_solr c:MissingSegmentRecoveryTest s:shard1 r:core_node1 x:MissingSegmentRecoveryTest_shard1_replica2] o.a.s.u.SolrIndexWriter Error closing IndexWriter
            [junit4]   2> java.nio.file.NoSuchFileException: /x1/jenkins/jenkins-slave/workspace/Lucene-Solr-Tests-master/solr/build/solr-core/test/J2/temp/solr.cloud.MissingSegmentRecoveryTest_B800C15EC6F11C02-001/tempDir-001/node2/MissingSegmentRecoveryTest_shard1_replica2/data/index.20170228030909468/write.lock
            [junit4]   2> 	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
            [junit4]   2> 	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
            [junit4]   2> 	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
            [junit4]   2> 	at sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
            [junit4]   2> 	at sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144)
            [junit4]   2> 	at sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
            [junit4]   2> 	at java.nio.file.Files.readAttributes(Files.java:1737)
            [junit4]   2> 	at org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:177)
            [junit4]   2> 	at org.apache.lucene.store.LockValidatingDirectoryWrapper.sync(LockValidatingDirectoryWrapper.java:67)
            [junit4]   2> 	at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4698)
            [junit4]   2> 	at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3093)
            [junit4]   2> 	at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3227)
            [junit4]   2> 	at org.apache.lucene.index.IndexWriter.shutdown(IndexWriter.java:1136)
            [junit4]   2> 	at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1179)
            [junit4]   2> 	at org.apache.solr.update.SolrIndexWriter.close(SolrIndexWriter.java:291)
            [junit4]   2> 	at org.apache.solr.core.SolrCore.initIndex(SolrCore.java:728)
            [junit4]   2> 	at org.apache.solr.core.SolrCore.<init>(SolrCore.java:911)
            [junit4]   2> 	at org.apache.solr.core.SolrCore.<init>(SolrCore.java:828)
            [junit4]   2> 	at org.apache.solr.core.CoreContainer.processCoreCreateException(CoreContainer.java:1011)
            [junit4]   2> 	at org.apache.solr.core.CoreContainer.create(CoreContainer.java:939)
            [junit4]   2> 	at org.apache.solr.core.CoreContainer.lambda$load$3(CoreContainer.java:572)
            [junit4]   2> 	at com.codahale.metrics.InstrumentedExecutorService$InstrumentedCallable.call(InstrumentedExecutorService.java:197)
            [junit4]   2> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
            [junit4]   2> 	at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
            [junit4]   2> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
            [junit4]   2> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
          [...]
            [junit4]   2> 600005 ERROR (coreContainerWorkExecutor-3250-thread-1-processing-n:127.0.0.1:41308_solr) [n:127.0.0.1:41308_solr    ] o.a.s.c.CoreContainer Error waiting for SolrCore to be created
            [junit4]   2> java.util.concurrent.ExecutionException: org.apache.solr.common.SolrException: Unable to create core [MissingSegmentRecoveryTest_shard1_replica2]
            [junit4]   2> 	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
            [junit4]   2> 	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
            [junit4]   2> 	at org.apache.solr.core.CoreContainer.lambda$load$4(CoreContainer.java:600)
            [junit4]   2> 	at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)
            [junit4]   2> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
            [junit4]   2> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
            [junit4]   2> 	at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
            [junit4]   2> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
            [junit4]   2> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
            [junit4]   2> 	at java.lang.Thread.run(Thread.java:745)
            [junit4]   2> Caused by: org.apache.solr.common.SolrException: Unable to create core [MissingSegmentRecoveryTest_shard1_replica2]
            [junit4]   2> 	at org.apache.solr.core.CoreContainer.create(CoreContainer.java:952)
            [junit4]   2> 	at org.apache.solr.core.CoreContainer.lambda$load$3(CoreContainer.java:572)
            [junit4]   2> 	at com.codahale.metrics.InstrumentedExecutorService$InstrumentedCallable.call(InstrumentedExecutorService.java:197)
            [junit4]   2> 	... 5 more
            [junit4]   2> Caused by: org.apache.solr.common.SolrException: Error opening new searcher
            [junit4]   2> 	at org.apache.solr.core.SolrCore.<init>(SolrCore.java:964)
            [junit4]   2> 	at org.apache.solr.core.SolrCore.<init>(SolrCore.java:828)
            [junit4]   2> 	at org.apache.solr.core.CoreContainer.processCoreCreateException(CoreContainer.java:1011)
            [junit4]   2> 	at org.apache.solr.core.CoreContainer.create(CoreContainer.java:939)
            [junit4]   2> 	... 7 more
            [junit4]   2> 	Suppressed: org.apache.solr.common.SolrException: Error opening new searcher
            [junit4]   2> 		at org.apache.solr.core.SolrCore.<init>(SolrCore.java:964)
            [junit4]   2> 		at org.apache.solr.core.SolrCore.<init>(SolrCore.java:828)
            [junit4]   2> 		at org.apache.solr.core.CoreContainer.create(CoreContainer.java:937)
            [junit4]   2> 		... 7 more
            [junit4]   2> 	Caused by: org.apache.solr.common.SolrException: Error opening new searcher
            [junit4]   2> 		at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:2005)
            [junit4]   2> 		at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2125)
            [junit4]   2> 		at org.apache.solr.core.SolrCore.initSearcher(SolrCore.java:1053)
            [junit4]   2> 		at org.apache.solr.core.SolrCore.<init>(SolrCore.java:937)
            [junit4]   2> 		... 9 more
            [junit4]   2> 	Caused by: org.apache.lucene.index.CorruptIndexException: Unexpected file read error while reading index. (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/x1/jenkins/jenkins-slave/workspace/Lucene-Solr-Tests-master/solr/build/solr-core/test/J2/temp/solr.cloud.MissingSegmentRecoveryTest_B800C15EC6F11C02-001/tempDir-001/node2/MissingSegmentRecoveryTest_shard1_replica2/data/index/segments_2")))
            [junit4]   2> 		at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:286)
            [junit4]   2> 		at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:938)
            [junit4]   2> 		at org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:125)
            [junit4]   2> 		at org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:100)
            [junit4]   2> 		at org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:240)
            [junit4]   2> 		at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:114)
            [junit4]   2> 		at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1966)
            [junit4]   2> 		... 12 more
            [junit4]   2> 	Caused by: java.io.EOFException: read past EOF: MMapIndexInput(path="/x1/jenkins/jenkins-slave/workspace/Lucene-Solr-Tests-master/solr/build/solr-core/test/J2/temp/solr.cloud.MissingSegmentRecoveryTest_B800C15EC6F11C02-001/tempDir-001/node2/MissingSegmentRecoveryTest_shard1_replica2/data/index/segments_2")
            [junit4]   2> 		at org.apache.lucene.store.ByteBufferIndexInput.readByte(ByteBufferIndexInput.java:75)
            [junit4]   2> 		at org.apache.lucene.store.BufferedChecksumIndexInput.readByte(BufferedChecksumIndexInput.java:41)
            [junit4]   2> 		at org.apache.lucene.store.DataInput.readInt(DataInput.java:101)
            [junit4]   2> 		at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:296)
            [junit4]   2> 		at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:284)
            [junit4]   2> 		... 18 more
            [junit4]   2> Caused by: org.apache.solr.common.SolrException: Error opening new searcher
            [junit4]   2> 	at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:2005)
            [junit4]   2> 	at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2125)
            [junit4]   2> 	at org.apache.solr.core.SolrCore.initSearcher(SolrCore.java:1053)
            [junit4]   2> 	at org.apache.solr.core.SolrCore.<init>(SolrCore.java:937)
            [junit4]   2> 	... 10 more
            [junit4]   2> Caused by: org.apache.lucene.index.IndexNotFoundException: no segments* file found in LockValidatingDirectoryWrapper(MMapDirectory@/x1/jenkins/jenkins-slave/workspace/Lucene-Solr-Tests-master/solr/build/solr-core/test/J2/temp/solr.cloud.MissingSegmentRecoveryTest_B800C15EC6F11C02-001/tempDir-001/node2/MissingSegmentRecoveryTest_shard1_replica2/data/index.20170228030909468 lockFactory=org.apache.lucene.store.NativeFSLockFactory@74782755): files: [write.lock]
            [junit4]   2> 	at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:933)
            [junit4]   2> 	at org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:125)
            [junit4]   2> 	at org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:100)
            [junit4]   2> 	at org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:240)
            [junit4]   2> 	at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:114)
            [junit4]   2> 	at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1966)
          [...]
            [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=MissingSegmentRecoveryTest -Dtests.method=testLeaderRecovery -Dtests.seed=B800C15EC6F11C02 -Dtests.multiplier=2 -Dtests.slow=true -Dtests.locale=fi-FI -Dtests.timezone=Asia/Famagusta -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1
            [junit4] FAILURE 94.6s J2 | MissingSegmentRecoveryTest.testLeaderRecovery <<<
            [junit4]    > Throwable #1: java.lang.AssertionError: Expected a collection with one shard and two replicas
            [junit4]    > null
            [junit4]    > Last available state: DocCollection(MissingSegmentRecoveryTest//collections/MissingSegmentRecoveryTest/state.json/6)={
            [junit4]    >   "replicationFactor":"2",
            [junit4]    >   "shards":{"shard1":{
            [junit4]    >       "range":"80000000-7fffffff",
            [junit4]    >       "state":"active",
            [junit4]    >       "replicas":{
            [junit4]    >         "core_node1":{
            [junit4]    >           "core":"MissingSegmentRecoveryTest_shard1_replica2",
            [junit4]    >           "base_url":"https://127.0.0.1:41308/solr",
            [junit4]    >           "node_name":"127.0.0.1:41308_solr",
            [junit4]    >           "state":"down"},
            [junit4]    >         "core_node2":{
            [junit4]    >           "core":"MissingSegmentRecoveryTest_shard1_replica1",
            [junit4]    >           "base_url":"https://127.0.0.1:60247/solr",
            [junit4]    >           "node_name":"127.0.0.1:60247_solr",
            [junit4]    >           "state":"active",
            [junit4]    >           "leader":"true"}}}},
            [junit4]    >   "router":{"name":"compositeId"},
            [junit4]    >   "maxShardsPerNode":"1",
            [junit4]    >   "autoAddReplicas":"false"}
            [junit4]    > 	at __randomizedtesting.SeedInfo.seed([B800C15EC6F11C02:E855595D9FD0AA1F]:0)
            [junit4]    > 	at org.apache.solr.cloud.SolrCloudTestCase.waitForState(SolrCloudTestCase.java:265)
            [junit4]    > 	at org.apache.solr.cloud.MissingSegmentRecoveryTest.testLeaderRecovery(MissingSegmentRecoveryTest.java:105)
          [...]
            [junit4]   2> NOTE: test params are: codec=Asserting(Lucene70): {_version_=TestBloomFilteredLucenePostings(BloomFilteringPostingsFormat(Lucene50(blocksize=128))), id=FST50}, docValues:{}, maxPointsInLeafNode=1106, maxMBSortInHeap=6.191537660994534, sim=RandomSimilarity(queryNorm=true): {}, locale=fi-FI, timezone=Asia/Famagusta
            [junit4]   2> NOTE: Linux 3.13.0-85-generic amd64/Oracle Corporation 1.8.0_121 (64-bit)/cpus=4,threads=1,free=138683768,total=527433728
          
          Show
          steve_rowe Steve Rowe added a comment - - edited MissingSegmentRecoveryTest.testLeaderRecovery() has been failing pretty regularly on Jenkins. Something happened on or about February 10th, when the probability of failure went up considerably (and has since remained at this elevated level). I got 3 failures beasting 100 iterations of the test suite using Miller's beasting script on my box. However, for the past three weeks I've gotten several failures a day on my Jenkins, and roughly once a day on either ASF or Policeman Jenkins. Here's a recent failure https://builds.apache.org/job/Lucene-Solr-Tests-master/1699/ : [junit4] 2> 599977 ERROR (coreLoadExecutor-3254-thread-1-processing-n:127.0.0.1:41308_solr) [n:127.0.0.1:41308_solr c:MissingSegmentRecoveryTest s:shard1 r:core_node1 x:MissingSegmentRecoveryTest_shard1_replica2] o.a.s.u.SolrIndexWriter Error closing IndexWriter [junit4] 2> java.nio.file.NoSuchFileException: /x1/jenkins/jenkins-slave/workspace/Lucene-Solr-Tests-master/solr/build/solr-core/test/J2/temp/solr.cloud.MissingSegmentRecoveryTest_B800C15EC6F11C02-001/tempDir-001/node2/MissingSegmentRecoveryTest_shard1_replica2/data/index.20170228030909468/write.lock [junit4] 2> at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) [junit4] 2> at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) [junit4] 2> at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) [junit4] 2> at sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55) [junit4] 2> at sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144) [junit4] 2> at sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99) [junit4] 2> at java.nio.file.Files.readAttributes(Files.java:1737) [junit4] 2> at org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:177) [junit4] 2> at org.apache.lucene.store.LockValidatingDirectoryWrapper.sync(LockValidatingDirectoryWrapper.java:67) [junit4] 2> at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4698) [junit4] 2> at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3093) [junit4] 2> at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3227) [junit4] 2> at org.apache.lucene.index.IndexWriter.shutdown(IndexWriter.java:1136) [junit4] 2> at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1179) [junit4] 2> at org.apache.solr.update.SolrIndexWriter.close(SolrIndexWriter.java:291) [junit4] 2> at org.apache.solr.core.SolrCore.initIndex(SolrCore.java:728) [junit4] 2> at org.apache.solr.core.SolrCore.<init>(SolrCore.java:911) [junit4] 2> at org.apache.solr.core.SolrCore.<init>(SolrCore.java:828) [junit4] 2> at org.apache.solr.core.CoreContainer.processCoreCreateException(CoreContainer.java:1011) [junit4] 2> at org.apache.solr.core.CoreContainer.create(CoreContainer.java:939) [junit4] 2> at org.apache.solr.core.CoreContainer.lambda$load$3(CoreContainer.java:572) [junit4] 2> at com.codahale.metrics.InstrumentedExecutorService$InstrumentedCallable.call(InstrumentedExecutorService.java:197) [junit4] 2> at java.util.concurrent.FutureTask.run(FutureTask.java:266) [junit4] 2> at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) [junit4] 2> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [junit4] 2> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [...] [junit4] 2> 600005 ERROR (coreContainerWorkExecutor-3250-thread-1-processing-n:127.0.0.1:41308_solr) [n:127.0.0.1:41308_solr ] o.a.s.c.CoreContainer Error waiting for SolrCore to be created [junit4] 2> java.util.concurrent.ExecutionException: org.apache.solr.common.SolrException: Unable to create core [MissingSegmentRecoveryTest_shard1_replica2] [junit4] 2> at java.util.concurrent.FutureTask.report(FutureTask.java:122) [junit4] 2> at java.util.concurrent.FutureTask.get(FutureTask.java:192) [junit4] 2> at org.apache.solr.core.CoreContainer.lambda$load$4(CoreContainer.java:600) [junit4] 2> at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176) [junit4] 2> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [junit4] 2> at java.util.concurrent.FutureTask.run(FutureTask.java:266) [junit4] 2> at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) [junit4] 2> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [junit4] 2> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [junit4] 2> at java.lang.Thread.run(Thread.java:745) [junit4] 2> Caused by: org.apache.solr.common.SolrException: Unable to create core [MissingSegmentRecoveryTest_shard1_replica2] [junit4] 2> at org.apache.solr.core.CoreContainer.create(CoreContainer.java:952) [junit4] 2> at org.apache.solr.core.CoreContainer.lambda$load$3(CoreContainer.java:572) [junit4] 2> at com.codahale.metrics.InstrumentedExecutorService$InstrumentedCallable.call(InstrumentedExecutorService.java:197) [junit4] 2> ... 5 more [junit4] 2> Caused by: org.apache.solr.common.SolrException: Error opening new searcher [junit4] 2> at org.apache.solr.core.SolrCore.<init>(SolrCore.java:964) [junit4] 2> at org.apache.solr.core.SolrCore.<init>(SolrCore.java:828) [junit4] 2> at org.apache.solr.core.CoreContainer.processCoreCreateException(CoreContainer.java:1011) [junit4] 2> at org.apache.solr.core.CoreContainer.create(CoreContainer.java:939) [junit4] 2> ... 7 more [junit4] 2> Suppressed: org.apache.solr.common.SolrException: Error opening new searcher [junit4] 2> at org.apache.solr.core.SolrCore.<init>(SolrCore.java:964) [junit4] 2> at org.apache.solr.core.SolrCore.<init>(SolrCore.java:828) [junit4] 2> at org.apache.solr.core.CoreContainer.create(CoreContainer.java:937) [junit4] 2> ... 7 more [junit4] 2> Caused by: org.apache.solr.common.SolrException: Error opening new searcher [junit4] 2> at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:2005) [junit4] 2> at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2125) [junit4] 2> at org.apache.solr.core.SolrCore.initSearcher(SolrCore.java:1053) [junit4] 2> at org.apache.solr.core.SolrCore.<init>(SolrCore.java:937) [junit4] 2> ... 9 more [junit4] 2> Caused by: org.apache.lucene.index.CorruptIndexException: Unexpected file read error while reading index. (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/x1/jenkins/jenkins-slave/workspace/Lucene-Solr-Tests-master/solr/build/solr-core/test/J2/temp/solr.cloud.MissingSegmentRecoveryTest_B800C15EC6F11C02-001/tempDir-001/node2/MissingSegmentRecoveryTest_shard1_replica2/data/index/segments_2"))) [junit4] 2> at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:286) [junit4] 2> at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:938) [junit4] 2> at org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:125) [junit4] 2> at org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:100) [junit4] 2> at org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:240) [junit4] 2> at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:114) [junit4] 2> at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1966) [junit4] 2> ... 12 more [junit4] 2> Caused by: java.io.EOFException: read past EOF: MMapIndexInput(path="/x1/jenkins/jenkins-slave/workspace/Lucene-Solr-Tests-master/solr/build/solr-core/test/J2/temp/solr.cloud.MissingSegmentRecoveryTest_B800C15EC6F11C02-001/tempDir-001/node2/MissingSegmentRecoveryTest_shard1_replica2/data/index/segments_2") [junit4] 2> at org.apache.lucene.store.ByteBufferIndexInput.readByte(ByteBufferIndexInput.java:75) [junit4] 2> at org.apache.lucene.store.BufferedChecksumIndexInput.readByte(BufferedChecksumIndexInput.java:41) [junit4] 2> at org.apache.lucene.store.DataInput.readInt(DataInput.java:101) [junit4] 2> at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:296) [junit4] 2> at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:284) [junit4] 2> ... 18 more [junit4] 2> Caused by: org.apache.solr.common.SolrException: Error opening new searcher [junit4] 2> at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:2005) [junit4] 2> at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2125) [junit4] 2> at org.apache.solr.core.SolrCore.initSearcher(SolrCore.java:1053) [junit4] 2> at org.apache.solr.core.SolrCore.<init>(SolrCore.java:937) [junit4] 2> ... 10 more [junit4] 2> Caused by: org.apache.lucene.index.IndexNotFoundException: no segments* file found in LockValidatingDirectoryWrapper(MMapDirectory@/x1/jenkins/jenkins-slave/workspace/Lucene-Solr-Tests-master/solr/build/solr-core/test/J2/temp/solr.cloud.MissingSegmentRecoveryTest_B800C15EC6F11C02-001/tempDir-001/node2/MissingSegmentRecoveryTest_shard1_replica2/data/index.20170228030909468 lockFactory=org.apache.lucene.store.NativeFSLockFactory@74782755): files: [write.lock] [junit4] 2> at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:933) [junit4] 2> at org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:125) [junit4] 2> at org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:100) [junit4] 2> at org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:240) [junit4] 2> at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:114) [junit4] 2> at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1966) [...] [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=MissingSegmentRecoveryTest -Dtests.method=testLeaderRecovery -Dtests.seed=B800C15EC6F11C02 -Dtests.multiplier=2 -Dtests.slow=true -Dtests.locale=fi-FI -Dtests.timezone=Asia/Famagusta -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1 [junit4] FAILURE 94.6s J2 | MissingSegmentRecoveryTest.testLeaderRecovery <<< [junit4] > Throwable #1: java.lang.AssertionError: Expected a collection with one shard and two replicas [junit4] > null [junit4] > Last available state: DocCollection(MissingSegmentRecoveryTest//collections/MissingSegmentRecoveryTest/state.json/6)={ [junit4] > "replicationFactor":"2", [junit4] > "shards":{"shard1":{ [junit4] > "range":"80000000-7fffffff", [junit4] > "state":"active", [junit4] > "replicas":{ [junit4] > "core_node1":{ [junit4] > "core":"MissingSegmentRecoveryTest_shard1_replica2", [junit4] > "base_url":"https://127.0.0.1:41308/solr", [junit4] > "node_name":"127.0.0.1:41308_solr", [junit4] > "state":"down"}, [junit4] > "core_node2":{ [junit4] > "core":"MissingSegmentRecoveryTest_shard1_replica1", [junit4] > "base_url":"https://127.0.0.1:60247/solr", [junit4] > "node_name":"127.0.0.1:60247_solr", [junit4] > "state":"active", [junit4] > "leader":"true"}}}}, [junit4] > "router":{"name":"compositeId"}, [junit4] > "maxShardsPerNode":"1", [junit4] > "autoAddReplicas":"false"} [junit4] > at __randomizedtesting.SeedInfo.seed([B800C15EC6F11C02:E855595D9FD0AA1F]:0) [junit4] > at org.apache.solr.cloud.SolrCloudTestCase.waitForState(SolrCloudTestCase.java:265) [junit4] > at org.apache.solr.cloud.MissingSegmentRecoveryTest.testLeaderRecovery(MissingSegmentRecoveryTest.java:105) [...] [junit4] 2> NOTE: test params are: codec=Asserting(Lucene70): {_version_=TestBloomFilteredLucenePostings(BloomFilteringPostingsFormat(Lucene50(blocksize=128))), id=FST50}, docValues:{}, maxPointsInLeafNode=1106, maxMBSortInHeap=6.191537660994534, sim=RandomSimilarity(queryNorm=true): {}, locale=fi-FI, timezone=Asia/Famagusta [junit4] 2> NOTE: Linux 3.13.0-85-generic amd64/Oracle Corporation 1.8.0_121 (64-bit)/cpus=4,threads=1,free=138683768,total=527433728
          Hide
          steve_rowe Steve Rowe added a comment -

          Non-reproducing master failure from my Jenkins yesterday:

          Checking out Revision 97ca529e49505cef0c1dd6138ed70be4a7b85610 (refs/remotes/origin/master)
          [...]
             [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=MissingSegmentRecoveryTest -Dtests.method=testLeaderRecovery -Dtests.seed=E0C710C4147CEA7B -Dtests.multiplier=2 -Dtests.nightly=true -Dtests.slow=true -Dtests.linedocsfile=/home/jenkins/lucene-data/enwiki.random.lines.txt -Dtests.locale=ar-BH -Dtests.timezone=Asia/Urumqi -Dtests.asserts=true -Dtests.file.encoding=UTF-8
             [junit4] FAILURE 95.9s J2 | MissingSegmentRecoveryTest.testLeaderRecovery <<<
             [junit4]    > Throwable #1: java.lang.AssertionError: Expected a collection with one shard and two replicas
             [junit4]    > null
             [junit4]    > Live Nodes: [127.0.0.1:42849_solr, 127.0.0.1:43941_solr]
             [junit4]    > Last available state: DocCollection(MissingSegmentRecoveryTest//collections/MissingSegmentRecoveryTest/state.json/9)={
             [junit4]    >   "pullReplicas":"0",
             [junit4]    >   "replicationFactor":"2",
             [junit4]    >   "shards":{"shard1":{
             [junit4]    >       "range":"80000000-7fffffff",
             [junit4]    >       "state":"active",
             [junit4]    >       "replicas":{
             [junit4]    >         "core_node1":{
             [junit4]    >           "core":"MissingSegmentRecoveryTest_shard1_replica_n1",
             [junit4]    >           "base_url":"https://127.0.0.1:42849/solr",
             [junit4]    >           "node_name":"127.0.0.1:42849_solr",
             [junit4]    >           "state":"active",
             [junit4]    >           "type":"NRT",
             [junit4]    >           "leader":"true"},
             [junit4]    >         "core_node2":{
             [junit4]    >           "core":"MissingSegmentRecoveryTest_shard1_replica_n2",
             [junit4]    >           "base_url":"https://127.0.0.1:43941/solr",
             [junit4]    >           "node_name":"127.0.0.1:43941_solr",
             [junit4]    >           "state":"down",
             [junit4]    >           "type":"NRT"}}}},
             [junit4]    >   "router":{"name":"compositeId"},
             [junit4]    >   "maxShardsPerNode":"1",
             [junit4]    >   "autoAddReplicas":"false",
             [junit4]    >   "nrtReplicas":"2",
             [junit4]    >   "tlogReplicas":"0"}
             [junit4]    > 	at __randomizedtesting.SeedInfo.seed([E0C710C4147CEA7B:B09288C74D5D5C66]:0)
             [junit4]    > 	at org.apache.solr.cloud.SolrCloudTestCase.waitForState(SolrCloudTestCase.java:269)
             [junit4]    > 	at org.apache.solr.cloud.MissingSegmentRecoveryTest.testLeaderRecovery(MissingSegmentRecoveryTest.java:105)
          [...]
             [junit4]   2> NOTE: test params are: codec=HighCompressionCompressingStoredFields(storedFieldsFormat=CompressingStoredFieldsFormat(compressionMode=HIGH_COMPRESSION, chunkSize=4, maxDocsPerChunk=1, blockSize=790), termVectorsFormat=CompressingTermVectorsFormat(compressionMode=HIGH_COMPRESSION, chunkSize=4, blockSize=790)), sim=RandomSimilarity(queryNorm=true): {}, locale=ar-BH, timezone=Asia/Urumqi
             [junit4]   2> NOTE: Linux 4.1.0-custom2-amd64 amd64/Oracle Corporation 1.8.0_77 (64-bit)/cpus=16,threads=1,free=300978976,total=530055168
             [junit4]   2> NOTE: All tests run in this JVM: [SolrCloudReportersTest, TestConfigSetsAPIExclusivity, TestCloudJSONFacetJoinDomain, RequestHandlersTest, TestRangeQuery, TestJsonFacetRefinement, ZkCLITest, ExternalFileFieldSortTest, LukeRequestHandlerTest, SimpleMLTQParserTest, AutoScalingHandlerTest, CdcrBootstrapTest, TestBulkSchemaConcurrent, CoreAdminHandlerTest, SuggestComponentTest, TestRuleBasedAuthorizationPlugin, CdcrUpdateLogTest, SpellCheckCollatorWithCollapseTest, SortByFunctionTest, MissingSegmentRecoveryTest]
          
          Show
          steve_rowe Steve Rowe added a comment - Non-reproducing master failure from my Jenkins yesterday: Checking out Revision 97ca529e49505cef0c1dd6138ed70be4a7b85610 (refs/remotes/origin/master) [...] [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=MissingSegmentRecoveryTest -Dtests.method=testLeaderRecovery -Dtests.seed=E0C710C4147CEA7B -Dtests.multiplier=2 -Dtests.nightly=true -Dtests.slow=true -Dtests.linedocsfile=/home/jenkins/lucene-data/enwiki.random.lines.txt -Dtests.locale=ar-BH -Dtests.timezone=Asia/Urumqi -Dtests.asserts=true -Dtests.file.encoding=UTF-8 [junit4] FAILURE 95.9s J2 | MissingSegmentRecoveryTest.testLeaderRecovery <<< [junit4] > Throwable #1: java.lang.AssertionError: Expected a collection with one shard and two replicas [junit4] > null [junit4] > Live Nodes: [127.0.0.1:42849_solr, 127.0.0.1:43941_solr] [junit4] > Last available state: DocCollection(MissingSegmentRecoveryTest//collections/MissingSegmentRecoveryTest/state.json/9)={ [junit4] > "pullReplicas":"0", [junit4] > "replicationFactor":"2", [junit4] > "shards":{"shard1":{ [junit4] > "range":"80000000-7fffffff", [junit4] > "state":"active", [junit4] > "replicas":{ [junit4] > "core_node1":{ [junit4] > "core":"MissingSegmentRecoveryTest_shard1_replica_n1", [junit4] > "base_url":"https://127.0.0.1:42849/solr", [junit4] > "node_name":"127.0.0.1:42849_solr", [junit4] > "state":"active", [junit4] > "type":"NRT", [junit4] > "leader":"true"}, [junit4] > "core_node2":{ [junit4] > "core":"MissingSegmentRecoveryTest_shard1_replica_n2", [junit4] > "base_url":"https://127.0.0.1:43941/solr", [junit4] > "node_name":"127.0.0.1:43941_solr", [junit4] > "state":"down", [junit4] > "type":"NRT"}}}}, [junit4] > "router":{"name":"compositeId"}, [junit4] > "maxShardsPerNode":"1", [junit4] > "autoAddReplicas":"false", [junit4] > "nrtReplicas":"2", [junit4] > "tlogReplicas":"0"} [junit4] > at __randomizedtesting.SeedInfo.seed([E0C710C4147CEA7B:B09288C74D5D5C66]:0) [junit4] > at org.apache.solr.cloud.SolrCloudTestCase.waitForState(SolrCloudTestCase.java:269) [junit4] > at org.apache.solr.cloud.MissingSegmentRecoveryTest.testLeaderRecovery(MissingSegmentRecoveryTest.java:105) [...] [junit4] 2> NOTE: test params are: codec=HighCompressionCompressingStoredFields(storedFieldsFormat=CompressingStoredFieldsFormat(compressionMode=HIGH_COMPRESSION, chunkSize=4, maxDocsPerChunk=1, blockSize=790), termVectorsFormat=CompressingTermVectorsFormat(compressionMode=HIGH_COMPRESSION, chunkSize=4, blockSize=790)), sim=RandomSimilarity(queryNorm=true): {}, locale=ar-BH, timezone=Asia/Urumqi [junit4] 2> NOTE: Linux 4.1.0-custom2-amd64 amd64/Oracle Corporation 1.8.0_77 (64-bit)/cpus=16,threads=1,free=300978976,total=530055168 [junit4] 2> NOTE: All tests run in this JVM: [SolrCloudReportersTest, TestConfigSetsAPIExclusivity, TestCloudJSONFacetJoinDomain, RequestHandlersTest, TestRangeQuery, TestJsonFacetRefinement, ZkCLITest, ExternalFileFieldSortTest, LukeRequestHandlerTest, SimpleMLTQParserTest, AutoScalingHandlerTest, CdcrBootstrapTest, TestBulkSchemaConcurrent, CoreAdminHandlerTest, SuggestComponentTest, TestRuleBasedAuthorizationPlugin, CdcrUpdateLogTest, SpellCheckCollatorWithCollapseTest, SortByFunctionTest, MissingSegmentRecoveryTest]

            People

            • Assignee:
              markrmiller@gmail.com Mark Miller
              Reporter:
              mdrob Mike Drob
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:

                Development