Solr
  1. Solr
  2. SOLR-2861

transaction log / realtime-get failures

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.0-ALPHA
    • Component/s: None
    • Labels:
      None

      Description

      realtime-get or the transaction log fails sometimes

      1. fail11.xml
        4.44 MB
        Yonik Seeley
      2. fail13_postfix.xml
        1.16 MB
        Yonik Seeley
      3. fail16.xml
        667 kB
        Yonik Seeley
      4. fail17.xml
        1.92 MB
        Yonik Seeley
      5. fail4.xml
        371 kB
        Yonik Seeley
      6. fail5.xml
        244 kB
        Yonik Seeley
      7. jenkins_output.txt
        18 kB
        Yonik Seeley
      8. solr_debug.patch
        30 kB
        Yonik Seeley
      9. TEST-org.apache.solr.search.TestRealTimeGet.xml
        930 kB
        Yonik Seeley
      10. trunk-0.txt
        2 kB
        selckin

        Issue Links

          Activity

          Hide
          Yonik Seeley added a comment -

          Here's the output from one of the failed jenkins runs. I have not been able to reproduce these EOF exceptions on my boxes.

          Show
          Yonik Seeley added a comment - Here's the output from one of the failed jenkins runs. I have not been able to reproduce these EOF exceptions on my boxes.
          Hide
          Yonik Seeley added a comment -

          One failure I was able to reproduce after running for an hour was this:

          SEVERE: java.lang.RuntimeException: java.lang.NullPointerException
                  at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1140)
                  at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.j
          ava:332)
                  at org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpda
          teProcessorFactory.java:84)
                  at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:158)
                  at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:78)
                  at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Conten
          tStreamHandlerBase.java:58)
                  at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBa
          se.java:129)
                  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1453)
                  at org.apache.solr.servlet.DirectSolrConnection.request(DirectSolrConnection
          .java:131)
                  at org.apache.solr.util.TestHarness.update(TestHarness.java:247)
                  at org.apache.solr.util.TestHarness.checkUpdateStatus(TestHarness.java:295)
                  at org.apache.solr.util.TestHarness.validateUpdate(TestHarness.java:265)
                  at org.apache.solr.SolrTestCaseJ4.checkUpdateU(SolrTestCaseJ4.java:357)
                  at org.apache.solr.SolrTestCaseJ4.assertU(SolrTestCaseJ4.java:336)
                  at org.apache.solr.SolrTestCaseJ4.assertU(SolrTestCaseJ4.java:330)
                  at org.apache.solr.search.TestRealTimeGet$1.run(TestRealTimeGet.java:212)
          Caused by: java.lang.NullPointerException
                  at org.apache.solr.core.SolrCore.getIndexDir(SolrCore.java:156)
                  at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1104)
                  ... 15 more
          

          But this doesn't look related.

          Show
          Yonik Seeley added a comment - One failure I was able to reproduce after running for an hour was this: SEVERE: java.lang.RuntimeException: java.lang.NullPointerException at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1140) at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.j ava:332) at org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpda teProcessorFactory.java:84) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:158) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:78) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Conten tStreamHandlerBase.java:58) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBa se.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1453) at org.apache.solr.servlet.DirectSolrConnection.request(DirectSolrConnection .java:131) at org.apache.solr.util.TestHarness.update(TestHarness.java:247) at org.apache.solr.util.TestHarness.checkUpdateStatus(TestHarness.java:295) at org.apache.solr.util.TestHarness.validateUpdate(TestHarness.java:265) at org.apache.solr.SolrTestCaseJ4.checkUpdateU(SolrTestCaseJ4.java:357) at org.apache.solr.SolrTestCaseJ4.assertU(SolrTestCaseJ4.java:336) at org.apache.solr.SolrTestCaseJ4.assertU(SolrTestCaseJ4.java:330) at org.apache.solr.search.TestRealTimeGet$1.run(TestRealTimeGet.java:212) Caused by: java.lang.NullPointerException at org.apache.solr.core.SolrCore.getIndexDir(SolrCore.java:156) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1104) ... 15 more But this doesn't look related.
          Hide
          Yonik Seeley added a comment -

          The reason for the bug above looks relatively clear - the _searcher variable is being accessed without grabbing searcherLock, and that's a no-no. SolrCore.getIndexDir() is buggy. I'll fix.

          Show
          Yonik Seeley added a comment - The reason for the bug above looks relatively clear - the _searcher variable is being accessed without grabbing searcherLock, and that's a no-no. SolrCore.getIndexDir() is buggy. I'll fix.
          Hide
          Yonik Seeley added a comment -

          OK, I finally got an assertion failure after running all night. Log attached.

          I've still never seen a single EOF exception like jenkins sees...

          Show
          Yonik Seeley added a comment - OK, I finally got an assertion failure after running all night. Log attached. I've still never seen a single EOF exception like jenkins sees...
          Hide
          Yonik Seeley added a comment -

          Thanks selckin! Looks like you hit what I did (a test assert, but not an EOF exception).
          How easy is it for you to reproduce?

          Show
          Yonik Seeley added a comment - Thanks selckin! Looks like you hit what I did (a test assert, but not an EOF exception). How easy is it for you to reproduce?
          Hide
          selckin added a comment -

          hit that one in about 30min then swapped to running all tests

          Show
          selckin added a comment - hit that one in about 30min then swapped to running all tests
          Hide
          Yonik Seeley added a comment -

          Here are some selected lines from fail4.xml, with comments:

          316   TEST:WRITER7: adding id 30 val= 3 DONE
          466   TEST:WRITER9: softCommit start
          # Note: this is only the start of softCommit in the test... it's synchronized in Solr so only one thread can reopen the
          # reader and create a new searcher at a time.
          3074  TEST:WRITER10: deleting id 30 val= 6 DONE
          # At this point the delete command has returned (and in the test, we put -6 in the model as the value)
          3592  VERBOSE:WRITER9: TLOG: preSoftCommit: prevMap=1355224018 new map=10577597
          # Note: actual reopen happens here, in a synchronized context.
          3593  VERBOSE:WRITER9: added searcher  Searcher@6460029d main  to _searchers reader version= 1319870174085
          3673  VERBOSE:READER7: TLOG: lookup: for id 30 in prevMap 10577597 got null lookupLog=/opt/code/lusolr/solr/build/solr-core/test/1/solrtest-TestRealTimeGet-1319870172806/tlog/tlog.0000000000000000002
          3673  VERBOSE:READER7: RealTimeGet using searcher  Searcher@6460029d main
          3677  TEST:READER7: ERROR, id= 30 foundVal= 3 model val= -6 realTime= true
          # We find the old value using the new searcher!
          

          From this, it doesn't look like it's the transaction log, or realtime-get.
          We can see that the doc was deleted, a new searcher was opened, and then it was used to do a lookup and got the old value.

          Brainstorming:

          • Bug in solr reopen code. This seems less likely since getSearcher() calls that result in a reopen are synchronized from the outside... so races shouldn't be possible there.
          • Bug in lucene. This is looking more likely now, but I can't yet reproduce it in a pure lucene test. It could be a reopen concurrency bug (like the last were) or possibly a MultiFields & friends bug (see SolrIndexSearcher/getFirstMatch used to do the lookup by id)
          Show
          Yonik Seeley added a comment - Here are some selected lines from fail4.xml, with comments: 316 TEST:WRITER7: adding id 30 val= 3 DONE 466 TEST:WRITER9: softCommit start # Note: this is only the start of softCommit in the test... it's synchronized in Solr so only one thread can reopen the # reader and create a new searcher at a time. 3074 TEST:WRITER10: deleting id 30 val= 6 DONE # At this point the delete command has returned (and in the test, we put -6 in the model as the value) 3592 VERBOSE:WRITER9: TLOG: preSoftCommit: prevMap=1355224018 new map=10577597 # Note: actual reopen happens here, in a synchronized context. 3593 VERBOSE:WRITER9: added searcher Searcher@6460029d main to _searchers reader version= 1319870174085 3673 VERBOSE:READER7: TLOG: lookup: for id 30 in prevMap 10577597 got null lookupLog=/opt/code/lusolr/solr/build/solr-core/test/1/solrtest-TestRealTimeGet-1319870172806/tlog/tlog.0000000000000000002 3673 VERBOSE:READER7: RealTimeGet using searcher Searcher@6460029d main 3677 TEST:READER7: ERROR, id= 30 foundVal= 3 model val= -6 realTime= true # We find the old value using the new searcher! From this, it doesn't look like it's the transaction log, or realtime-get. We can see that the doc was deleted, a new searcher was opened, and then it was used to do a lookup and got the old value. Brainstorming: Bug in solr reopen code. This seems less likely since getSearcher() calls that result in a reopen are synchronized from the outside... so races shouldn't be possible there. Bug in lucene. This is looking more likely now, but I can't yet reproduce it in a pure lucene test. It could be a reopen concurrency bug (like the last were) or possibly a MultiFields & friends bug (see SolrIndexSearcher/getFirstMatch used to do the lookup by id)
          Hide
          Yonik Seeley added a comment -

          Here's the fail5.xml analysis (more debugging prints this run):

          2340  TEST:WRITER5: adding id 0 val= 7 DONE
          2347  VERBOSE:WRITER3: commit reopen start: reopenWithWriter= true searcher= Searcher@1e9cd8db main{DirectoryReader(segments_4:1319924531747:nrt _3(4.0):C7/4 _2(4.0):C4/3),version=1319924531747} reader= DirectoryReader(segments_4:1319924531747:nrt _3(4.0):C7/4 _2(4.0):C4/3)
          
          2438 TEST:WRITER6: deleting id 0 val= 8
          2480 TEST:WRITER6: deleting id 0 val= 8 DONE
          
          2491  VERBOSE:WRITER3: reopen DONE: searcher= Searcher@2aa937cd main{DirectoryReader(segments_4:1319924531751:nrt _4(4.0):C13/8),version=1319924531751} reader= DirectoryReader(segments_4:1319924531751:nrt _4(4.0):C13/8)
          
          ### Note that the delete happened sometime during (or right before or after) the reopen, so we don't know if version 1319924531751 reflects the delete.  That's OK, because we are going to do another reopen next:
          
          2516  VERBOSE:WRITER6: commit reopen start: reopenWithWriter= true searcher= Searcher@2aa937cd main{DirectoryReader(segments_4:1319924531751:nrt _4(4.0):C13/8),version=1319924531751} reader= DirectoryReader(segments_4:1319924531751:nrt _4(4.0):C13/8)
          2518  VERBOSE:WRITER6: reopen DONE: searcher= Searcher@59c958af main{DirectoryReader(segments_4:1319924531751:nrt _4(4.0):C13/8),version=1319924531751} reader= DirectoryReader(segments_4:1319924531751:nrt _4(4.0):C13/8)
          
          ### OK... so the reopen didn't change the reader.  That's OK, since there weren't necessarily any new
              changes (i.e. the delete of 0 could be reflected in the previous reopen).
          
          2529  VERBOSE:READER18: RealTimeGet using searcher  Searcher@59c958af main{DirectoryReader(segments_4:1319924531751:nrt _4(4.0):C13/8),version=1319924531751}
          2531  TEST:READER18: ERROR, id= 0 foundVal= 7 model val= -8 realTime= true
          
          #### Oops, we've found the old value again somehow.
          

          Starting to look more like a lucene bug - but still can't reproduce at that level.

          Show
          Yonik Seeley added a comment - Here's the fail5.xml analysis (more debugging prints this run): 2340 TEST:WRITER5: adding id 0 val= 7 DONE 2347 VERBOSE:WRITER3: commit reopen start: reopenWithWriter= true searcher= Searcher@1e9cd8db main{DirectoryReader(segments_4:1319924531747:nrt _3(4.0):C7/4 _2(4.0):C4/3),version=1319924531747} reader= DirectoryReader(segments_4:1319924531747:nrt _3(4.0):C7/4 _2(4.0):C4/3) 2438 TEST:WRITER6: deleting id 0 val= 8 2480 TEST:WRITER6: deleting id 0 val= 8 DONE 2491 VERBOSE:WRITER3: reopen DONE: searcher= Searcher@2aa937cd main{DirectoryReader(segments_4:1319924531751:nrt _4(4.0):C13/8),version=1319924531751} reader= DirectoryReader(segments_4:1319924531751:nrt _4(4.0):C13/8) ### Note that the delete happened sometime during (or right before or after) the reopen, so we don't know if version 1319924531751 reflects the delete. That's OK, because we are going to do another reopen next: 2516 VERBOSE:WRITER6: commit reopen start: reopenWithWriter= true searcher= Searcher@2aa937cd main{DirectoryReader(segments_4:1319924531751:nrt _4(4.0):C13/8),version=1319924531751} reader= DirectoryReader(segments_4:1319924531751:nrt _4(4.0):C13/8) 2518 VERBOSE:WRITER6: reopen DONE: searcher= Searcher@59c958af main{DirectoryReader(segments_4:1319924531751:nrt _4(4.0):C13/8),version=1319924531751} reader= DirectoryReader(segments_4:1319924531751:nrt _4(4.0):C13/8) ### OK... so the reopen didn't change the reader. That's OK, since there weren't necessarily any new changes (i.e. the delete of 0 could be reflected in the previous reopen). 2529 VERBOSE:READER18: RealTimeGet using searcher Searcher@59c958af main{DirectoryReader(segments_4:1319924531751:nrt _4(4.0):C13/8),version=1319924531751} 2531 TEST:READER18: ERROR, id= 0 foundVal= 7 model val= -8 realTime= true #### Oops, we've found the old value again somehow. Starting to look more like a lucene bug - but still can't reproduce at that level.
          Hide
          Yonik Seeley added a comment -

          Update: I think I may have found one logic bug in solr - a hard commit reopen can go back in time since it doesn't use the writer to repoen. For example Thread1: commit() Thread2: reopen(writer) Thread1: reopen(without_writer). That last reopen will reopen at the point of the commit, going back in time if updates happened in between.

          Now the bad news: I changed the test to only use soft commits, and still hit problems. I also changed some of the prints to immediately proceed and follow the lucene API calls to try and further nail this down. The log file was too large to attach, but here is the analysis:

          13628  TEST:WRITER21: adding id 4 val= 21
          13779  VERBOSE:WRITER21: updateDocument id:4 Document<stored,indexed,indexOptions=DOCS_ONLY<id:4> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<val_l:21>> org.apache.solr.update.SolrIndexWriter@4ba76eff
          13969  VERBOSE:WRITER21: updateDocument id:4 DONE
          
          69276  VERBOSE:WRITER3: start reopen from Searcher@17db8f8e main{DirectoryReader(segments_4:1320005350742:nrt _4h(4.0):C18/17 _4i(4.0):C3/2 _4j(4.0):C1 _4k(4.0):C3/1 _4l(4.0):C6)} writer= org.apache.solr.update.SolrIndexWriter@4ba76eff
          69293  VERBOSE:WRITER3: reopen result DirectoryReader(segments_4:1320005350743:nrt _4h(4.0):C18/17 _4i(4.0):C3/2 _4j(4.0):C1 _4k(4.0):C3/1 _4l(4.0):C6/1)
          
          69349  TEST:WRITER2: deleting id 4 val= 22
          69365  VERBOSE:WRITER2: deleteDocuments id:4 org.apache.solr.update.SolrIndexWriter@4ba76eff
          69366  VERBOSE:WRITER9: start reopen from Searcher@256d6cf main{DirectoryReader(segments_4:1320005350743:nrt _4h(4.0):C18/17 _4i(4.0):C3/2 _4j(4.0):C1 _4k(4.0):C3/1 _4l(4.0):C6/1)} writer= org.apache.solr.update.SolrIndexWriter@4ba76eff
          69367 VERBOSE:WRITER9: reopen result (null)
          
          69394  VERBOSE:WRITER2: deleteDocuments id:4 DONE
          
          69424  VERBOSE:WRITER7: start reopen from Searcher@61c3e3fb main{DirectoryReader(segments_4:1320005350743:nrt _4h(4.0):C18/17 _4i(4.0):C3/2 _4j(4.0):C1 _4k(4.0):C3/1 _4l(4.0):C6/1)} writer= org.apache.solr.update.SolrIndexWriter@4ba76eff
          69426  VERBOSE:WRITER7: reopen result (null)
          T
          
          69447  VERBOSE:READER14: RealTimeGet using searcher  Searcher@742136c6 main{DirectoryReader(segments_4:1320005350743:nrt _4h(4.0):C18/17 _4i(4.0):C3/2 _4j(4.0):C1 _4k(4.0):C3/1 _4l(4.0):C6/1)}
          TEST:READER14: ERROR, id= 4 foundVal= 21 model val= -20 realTime= true
          

          So as you can see, this looks much like anaylsis5 - a delete happened during the first reopen (hence may or may not be reflected in that reopen), and then there was another reopen which resulted in "null" (no changes). But a lookup using the resulting reader finds the document which should be deleted.

          This is still looking like we have a lucene bug.

          Show
          Yonik Seeley added a comment - Update: I think I may have found one logic bug in solr - a hard commit reopen can go back in time since it doesn't use the writer to repoen. For example Thread1: commit() Thread2: reopen(writer) Thread1: reopen(without_writer). That last reopen will reopen at the point of the commit, going back in time if updates happened in between. Now the bad news: I changed the test to only use soft commits, and still hit problems. I also changed some of the prints to immediately proceed and follow the lucene API calls to try and further nail this down. The log file was too large to attach, but here is the analysis: 13628 TEST:WRITER21: adding id 4 val= 21 13779 VERBOSE:WRITER21: updateDocument id:4 Document<stored,indexed,indexOptions=DOCS_ONLY<id:4> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<val_l:21>> org.apache.solr.update.SolrIndexWriter@4ba76eff 13969 VERBOSE:WRITER21: updateDocument id:4 DONE 69276 VERBOSE:WRITER3: start reopen from Searcher@17db8f8e main{DirectoryReader(segments_4:1320005350742:nrt _4h(4.0):C18/17 _4i(4.0):C3/2 _4j(4.0):C1 _4k(4.0):C3/1 _4l(4.0):C6)} writer= org.apache.solr.update.SolrIndexWriter@4ba76eff 69293 VERBOSE:WRITER3: reopen result DirectoryReader(segments_4:1320005350743:nrt _4h(4.0):C18/17 _4i(4.0):C3/2 _4j(4.0):C1 _4k(4.0):C3/1 _4l(4.0):C6/1) 69349 TEST:WRITER2: deleting id 4 val= 22 69365 VERBOSE:WRITER2: deleteDocuments id:4 org.apache.solr.update.SolrIndexWriter@4ba76eff 69366 VERBOSE:WRITER9: start reopen from Searcher@256d6cf main{DirectoryReader(segments_4:1320005350743:nrt _4h(4.0):C18/17 _4i(4.0):C3/2 _4j(4.0):C1 _4k(4.0):C3/1 _4l(4.0):C6/1)} writer= org.apache.solr.update.SolrIndexWriter@4ba76eff 69367 VERBOSE:WRITER9: reopen result ( null ) 69394 VERBOSE:WRITER2: deleteDocuments id:4 DONE 69424 VERBOSE:WRITER7: start reopen from Searcher@61c3e3fb main{DirectoryReader(segments_4:1320005350743:nrt _4h(4.0):C18/17 _4i(4.0):C3/2 _4j(4.0):C1 _4k(4.0):C3/1 _4l(4.0):C6/1)} writer= org.apache.solr.update.SolrIndexWriter@4ba76eff 69426 VERBOSE:WRITER7: reopen result ( null ) T 69447 VERBOSE:READER14: RealTimeGet using searcher Searcher@742136c6 main{DirectoryReader(segments_4:1320005350743:nrt _4h(4.0):C18/17 _4i(4.0):C3/2 _4j(4.0):C1 _4k(4.0):C3/1 _4l(4.0):C6/1)} TEST:READER14: ERROR, id= 4 foundVal= 21 model val= -20 realTime= true So as you can see, this looks much like anaylsis5 - a delete happened during the first reopen (hence may or may not be reflected in that reopen), and then there was another reopen which resulted in "null" (no changes). But a lookup using the resulting reader finds the document which should be deleted. This is still looking like we have a lucene bug.
          Hide
          Yonik Seeley added a comment -

          Here's a patch with the debugging code I've been using. Notice the prints surrounding the lucene API calls for adding, deleting, and reopening.

          Show
          Yonik Seeley added a comment - Here's a patch with the debugging code I've been using. Notice the prints surrounding the lucene API calls for adding, deleting, and reopening.
          Hide
          Simon Willnauer added a comment -

          tricky shit man! I think I found the issue! we only check the bufferedDeleteStream for changes in IW#nrtIsCurrent instead we should check the deleteQueue since there could be some deletes still in the queue but not yet applied to the delete stream. here is a diff though:

          Index: java/org/apache/lucene/index/IndexWriter.java
          ===================================================================
          --- java/org/apache/lucene/index/IndexWriter.java	(revision 1195214)
          +++ java/org/apache/lucene/index/IndexWriter.java	(working copy)
          @@ -4074,7 +4074,7 @@
             synchronized boolean nrtIsCurrent(SegmentInfos infos) {
               //System.out.println("IW.nrtIsCurrent " + (infos.version == segmentInfos.version && !docWriter.anyChanges() && !bufferedDeletesStream.any()));
               ensureOpen();
          -    return infos.version == segmentInfos.version && !docWriter.anyChanges() && !bufferedDeletesStream.any();
          +    return infos.version == segmentInfos.version && !docWriter.anyChanges() && !docWriter.deleteQueue.anyChanges();
             }
          

          yonik, I didn't try this since I only looked at the code for a little while to figure out what could be the reason you see this. Can you verify?

          Show
          Simon Willnauer added a comment - tricky shit man! I think I found the issue! we only check the bufferedDeleteStream for changes in IW#nrtIsCurrent instead we should check the deleteQueue since there could be some deletes still in the queue but not yet applied to the delete stream. here is a diff though: Index: java/org/apache/lucene/index/IndexWriter.java =================================================================== --- java/org/apache/lucene/index/IndexWriter.java (revision 1195214) +++ java/org/apache/lucene/index/IndexWriter.java (working copy) @@ -4074,7 +4074,7 @@ synchronized boolean nrtIsCurrent(SegmentInfos infos) { // System .out.println( "IW.nrtIsCurrent " + (infos.version == segmentInfos.version && !docWriter.anyChanges() && !bufferedDeletesStream.any())); ensureOpen(); - return infos.version == segmentInfos.version && !docWriter.anyChanges() && !bufferedDeletesStream.any(); + return infos.version == segmentInfos.version && !docWriter.anyChanges() && !docWriter.deleteQueue.anyChanges(); } yonik, I didn't try this since I only looked at the code for a little while to figure out what could be the reason you see this. Can you verify?
          Hide
          Simon Willnauer added a comment -

          I created LUCENE-3546 to track this problem further

          Show
          Simon Willnauer added a comment - I created LUCENE-3546 to track this problem further
          Hide
          Yonik Seeley added a comment -

          Thanks Simon - I'll see if I can reproduce a failure with infostream on.

          Show
          Yonik Seeley added a comment - Thanks Simon - I'll see if I can reproduce a failure with infostream on.
          Hide
          Yonik Seeley added a comment -

          I haven't yet been able to reproduce with infostream on, but here's another failure w/ it off (again raw log file too big to attach)

          187551  VERBOSE:WRITER27: deleteDocuments id:25 org.apache.solr.update.SolrIndexWriter@10c789fb
          187552  VERBOSE:WRITER0: start reopen from Searcher@5df9cdda main{DirectoryReader(segments_t:1320072235329:nrt _hz(4.0):C18/11 _hy(4.0):c1 _hw(4.0):c1 _i1(4.0):C12/6 _i3(4.0):c1 _i4(4.0):C3 _i5(4.0):c1)} writer= org.apache.solr.update.SolrIndexWriter@10c789fb
          187553  VERBOSE:WRITER0: reopen result (null)
          187554  VERBOSE:WRITER27: deleteDocuments id:25 DONE
          
          187559  VERBOSE:WRITER4: start reopen from Searcher@5166b0df main{DirectoryReader(segments_t:1320072235329:nrt _hz(4.0):C18/11 _hy(4.0):c1 _hw(4.0):c1 _i1(4.0):C12/6 _i3(4.0):c1 _i4(4.0):C3 _i5(4.0):c1)} writer= org.apache.solr.update.SolrIndexWriter@10c789fb
          187560  VERBOSE:WRITER4: reopen result (null)
          
          187891  VERBOSE:READER8: RealTimeGet using searcher  Searcher@642424ad main{DirectoryReader(segments_t:1320072235329:nrt _hz(4.0):C18/11 _hy(4.0):c1 _hw(4.0):c1 _i1(4.0):C12/6 _i3(4.0):c1 _i4(4.0):C3 _i5(4.0):c1)}
          187892  TEST:READER8: ERROR, id= 25 foundVal= 9 model val= -10 realTime= true
          

          Definitely a pattern developing here (same as the previous two)... a delete happens concurrently with a reopen (of course from prints alone, we can't tell if it was just before, during, or just after). The reopen following that reopen should reflect the delete at least... but it doesn't.

          Show
          Yonik Seeley added a comment - I haven't yet been able to reproduce with infostream on, but here's another failure w/ it off (again raw log file too big to attach) 187551 VERBOSE:WRITER27: deleteDocuments id:25 org.apache.solr.update.SolrIndexWriter@10c789fb 187552 VERBOSE:WRITER0: start reopen from Searcher@5df9cdda main{DirectoryReader(segments_t:1320072235329:nrt _hz(4.0):C18/11 _hy(4.0):c1 _hw(4.0):c1 _i1(4.0):C12/6 _i3(4.0):c1 _i4(4.0):C3 _i5(4.0):c1)} writer= org.apache.solr.update.SolrIndexWriter@10c789fb 187553 VERBOSE:WRITER0: reopen result ( null ) 187554 VERBOSE:WRITER27: deleteDocuments id:25 DONE 187559 VERBOSE:WRITER4: start reopen from Searcher@5166b0df main{DirectoryReader(segments_t:1320072235329:nrt _hz(4.0):C18/11 _hy(4.0):c1 _hw(4.0):c1 _i1(4.0):C12/6 _i3(4.0):c1 _i4(4.0):C3 _i5(4.0):c1)} writer= org.apache.solr.update.SolrIndexWriter@10c789fb 187560 VERBOSE:WRITER4: reopen result ( null ) 187891 VERBOSE:READER8: RealTimeGet using searcher Searcher@642424ad main{DirectoryReader(segments_t:1320072235329:nrt _hz(4.0):C18/11 _hy(4.0):c1 _hw(4.0):c1 _i1(4.0):C12/6 _i3(4.0):c1 _i4(4.0):C3 _i5(4.0):c1)} 187892 TEST:READER8: ERROR, id= 25 foundVal= 9 model val= -10 realTime= true Definitely a pattern developing here (same as the previous two)... a delete happens concurrently with a reopen (of course from prints alone, we can't tell if it was just before, during, or just after). The reopen following that reopen should reflect the delete at least... but it doesn't.
          Hide
          Yonik Seeley added a comment -

          Update: I think I may have found one logic bug in solr - a hard commit reopen can go back in time since it doesn't use the writer to repoen. For example Thread1: commit() Thread2: reopen(writer) Thread1: reopen(without_writer). That last reopen will reopen at the point of the commit, going back in time if updates happened in between.

          Actually, I don't think this is true. AFAIK, once you call openIfChanged(reader,writer) it doesn't matter if future calls are just to openIfChanged(reader) since reopening an NRT reader will always result in another NRT reader?

          Show
          Yonik Seeley added a comment - Update: I think I may have found one logic bug in solr - a hard commit reopen can go back in time since it doesn't use the writer to repoen. For example Thread1: commit() Thread2: reopen(writer) Thread1: reopen(without_writer). That last reopen will reopen at the point of the commit, going back in time if updates happened in between. Actually, I don't think this is true. AFAIK, once you call openIfChanged(reader,writer) it doesn't matter if future calls are just to openIfChanged(reader) since reopening an NRT reader will always result in another NRT reader?
          Hide
          Yonik Seeley added a comment - - edited

          OK, it finally failed with the infostream set (see fail11.xml)! Unfortunately, I'm not sure if it will provide much (or any) info.

          
          46563  VERBOSE:WRITER7: start reopen from Searcher@5f159e0c main{DirectoryReader(segments_2:1320108281745:nrt _5p(4.0):C20/16 _5n(4.0):C7/5 _5q(4.0):C7/2 _5r(4.0):C7 _5s(4.0):C2)} writer= org.apache.solr.update.SolrIndexWriter@1de4eb5b
          46686  VERBOSE:WRITER7: reopen result DirectoryReader(segments_2:1320108281747:nrt _5p(4.0):C20/16 _5n(4.0):C7/5 _5q(4.0):C7/3 _5r(4.0):C7/1 _5s(4.0):C2 _5t(4.0):C1)
          
          46706  VERBOSE:WRITER11: deleteDocuments id:12 org.apache.solr.update.SolrIndexWriter@1de4eb5b
          46707  VERBOSE:WRITER0: start reopen from Searcher@502c06b2 main{DirectoryReader(segments_2:1320108281747:nrt _5p(4.0):C20/16 _5n(4.0):C7/5 _5q(4.0):C7/3 _5r(4.0):C7/1 _5s(4.0):C2 _5t(4.0):C1)} writer= org.apache.solr.update.SolrIndexWriter@1de4eb5b
          46708  VERBOSE:WRITER0: reopen result (null)
          46709  VERBOSE:WRITER11: deleteDocuments id:12 DONE
          
          46714  VERBOSE:WRITER12: start reopen from Searcher@7a6bb93c main{DirectoryReader(segments_2:1320108281747:nrt _5p(4.0):C20/16 _5n(4.0):C7/5 _5q(4.0):C7/3 _5r(4.0):C7/1 _5s(4.0):C2 _5t(4.0):C1)} writer= org.apache.solr.update.SolrIndexWriter@1de4eb5b
          46715  VERBOSE:WRITER12: reopen result (null)
          
          46924  VERBOSE:READER1: RealTimeGet using searcher  Searcher@69ed2bbd main{DirectoryReader(segments_3:1320108281747:nrt _5p(4.0):C20/16 _5n(4.0):C7/5 _5q(4.0):C7/3 _5r(4.0):C7/1 _5s(4.0):C2 _5t(4.0):C1)}
          46926  VERBOSE:READER1: ERROR, id= 12 foundVal= 30 model val= -31 realTime= true
          
          

          It's the same story: the delete happened concurrently (or slightly before or after) the reopen. Both that reopen, and the following reopen returned null, which is definitely a bug.

          Show
          Yonik Seeley added a comment - - edited OK, it finally failed with the infostream set (see fail11.xml)! Unfortunately, I'm not sure if it will provide much (or any) info. 46563 VERBOSE:WRITER7: start reopen from Searcher@5f159e0c main{DirectoryReader(segments_2:1320108281745:nrt _5p(4.0):C20/16 _5n(4.0):C7/5 _5q(4.0):C7/2 _5r(4.0):C7 _5s(4.0):C2)} writer= org.apache.solr.update.SolrIndexWriter@1de4eb5b 46686 VERBOSE:WRITER7: reopen result DirectoryReader(segments_2:1320108281747:nrt _5p(4.0):C20/16 _5n(4.0):C7/5 _5q(4.0):C7/3 _5r(4.0):C7/1 _5s(4.0):C2 _5t(4.0):C1) 46706 VERBOSE:WRITER11: deleteDocuments id:12 org.apache.solr.update.SolrIndexWriter@1de4eb5b 46707 VERBOSE:WRITER0: start reopen from Searcher@502c06b2 main{DirectoryReader(segments_2:1320108281747:nrt _5p(4.0):C20/16 _5n(4.0):C7/5 _5q(4.0):C7/3 _5r(4.0):C7/1 _5s(4.0):C2 _5t(4.0):C1)} writer= org.apache.solr.update.SolrIndexWriter@1de4eb5b 46708 VERBOSE:WRITER0: reopen result ( null ) 46709 VERBOSE:WRITER11: deleteDocuments id:12 DONE 46714 VERBOSE:WRITER12: start reopen from Searcher@7a6bb93c main{DirectoryReader(segments_2:1320108281747:nrt _5p(4.0):C20/16 _5n(4.0):C7/5 _5q(4.0):C7/3 _5r(4.0):C7/1 _5s(4.0):C2 _5t(4.0):C1)} writer= org.apache.solr.update.SolrIndexWriter@1de4eb5b 46715 VERBOSE:WRITER12: reopen result ( null ) 46924 VERBOSE:READER1: RealTimeGet using searcher Searcher@69ed2bbd main{DirectoryReader(segments_3:1320108281747:nrt _5p(4.0):C20/16 _5n(4.0):C7/5 _5q(4.0):C7/3 _5r(4.0):C7/1 _5s(4.0):C2 _5t(4.0):C1)} 46926 VERBOSE:READER1: ERROR, id= 12 foundVal= 30 model val= -31 realTime= true It's the same story: the delete happened concurrently (or slightly before or after) the reopen. Both that reopen, and the following reopen returned null, which is definitely a bug.
          Hide
          Simon Willnauer added a comment -

          I reopened LUCENE-3546 since the bug actually exists. see my comments on LUCENE-3546

          Show
          Simon Willnauer added a comment - I reopened LUCENE-3546 since the bug actually exists. see my comments on LUCENE-3546
          Hide
          Michael McCandless added a comment -

          once you call openIfChanged(reader,writer) it doesn't matter if future calls are just to openIfChanged(reader) since reopening an NRT reader will always result in another NRT reader?

          That's correct.

          Show
          Michael McCandless added a comment - once you call openIfChanged(reader,writer) it doesn't matter if future calls are just to openIfChanged(reader) since reopening an NRT reader will always result in another NRT reader? That's correct.
          Hide
          Yonik Seeley added a comment -

          Here's a run (fail13) w/ mixed hard+soft commits after Simon's patch.

          14047  VERBOSE:WRITER5: start reopen without writer, reader= DirectoryReader(segments_6:1320156373422:nrt _y(4.0):C78/7 _n(4.0):c2/1 _x(4.0):c11/1 _w(4.0):c1)
          14184  VERBOSE:WRITER14: updateDocument id:98 Document<stored,indexed,indexOptions=DOCS_ONLY<id:98> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<val_l:3>> org.apache.solr.update.SolrIndexWriter@43a4181c
          14504  VERBOSE:WRITER14: updateDocument id:98 DONE
          14523  VERBOSE:WRITER5: reopen result DirectoryReader(segments_6:1320156373425:nrt _y(4.0):C78/15 _n(4.0):c2/1 _x(4.0):c11/1 _w(4.0):c1 _z(4.0):C7)
          
          ### The update to id:98 may or may not be in the resulting reader above
          
          14728  VERBOSE:WRITER8: start reopen from Searcher@28294f62 main{DirectoryReader(segments_6:1320156373425:nrt _y(4.0):C78/15 _n(4.0):c2/1 _x(4.0):c11/1 _w(4.0):c1 _z(4.0):C7)} writer= org.apache.solr.update.SolrIndexWriter@43a4181c
          14729  VERBOSE:WRITER8: reopen result (null)
          
          ### But this reopen shows no changes, so the update id:98 must be in reader 1320156373425
          
          14920  VERBOSE:READER2: RealTimeGet using searcher  Searcher@79b0d33c main{DirectoryReader(segments_6:1320156373425:nrt _y(4.0):C78/15 _n(4.0):c2/1 _x(4.0):c11/1 _w(4.0):c1 _z(4.0):C7)}
          14922  VERBOSE:READER2: ERROR, id= 98 foundVal= 2 model val= 3 realTime= true
          
          ### Since we got the old value, we either missed the whole update, or just the delete part of the update
          (since the uniqueKey lookup stops at the first match, the new document may be in the index, but we hit the old document first that wasn't marked as deleted)
          
          Show
          Yonik Seeley added a comment - Here's a run (fail13) w/ mixed hard+soft commits after Simon's patch. 14047 VERBOSE:WRITER5: start reopen without writer, reader= DirectoryReader(segments_6:1320156373422:nrt _y(4.0):C78/7 _n(4.0):c2/1 _x(4.0):c11/1 _w(4.0):c1) 14184 VERBOSE:WRITER14: updateDocument id:98 Document<stored,indexed,indexOptions=DOCS_ONLY<id:98> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<val_l:3>> org.apache.solr.update.SolrIndexWriter@43a4181c 14504 VERBOSE:WRITER14: updateDocument id:98 DONE 14523 VERBOSE:WRITER5: reopen result DirectoryReader(segments_6:1320156373425:nrt _y(4.0):C78/15 _n(4.0):c2/1 _x(4.0):c11/1 _w(4.0):c1 _z(4.0):C7) ### The update to id:98 may or may not be in the resulting reader above 14728 VERBOSE:WRITER8: start reopen from Searcher@28294f62 main{DirectoryReader(segments_6:1320156373425:nrt _y(4.0):C78/15 _n(4.0):c2/1 _x(4.0):c11/1 _w(4.0):c1 _z(4.0):C7)} writer= org.apache.solr.update.SolrIndexWriter@43a4181c 14729 VERBOSE:WRITER8: reopen result ( null ) ### But this reopen shows no changes, so the update id:98 must be in reader 1320156373425 14920 VERBOSE:READER2: RealTimeGet using searcher Searcher@79b0d33c main{DirectoryReader(segments_6:1320156373425:nrt _y(4.0):C78/15 _n(4.0):c2/1 _x(4.0):c11/1 _w(4.0):c1 _z(4.0):C7)} 14922 VERBOSE:READER2: ERROR, id= 98 foundVal= 2 model val= 3 realTime= true ### Since we got the old value, we either missed the whole update, or just the delete part of the update (since the uniqueKey lookup stops at the first match, the new document may be in the index, but we hit the old document first that wasn't marked as deleted)
          Hide
          Simon Willnauer added a comment -
          #those two happen at the same time roughly
          14728 VERBOSE:WRITER8: start reopen from Searcher@28294f62 main{DirectoryReader(segments_6:1320156373425:nrt _y(4.0):C78/15 _n(4.0):c2/1 _x(4.0):c11/1 _w(4.0):c1 _z(4.0):C7)} writer= org.apache.solr.update.SolrIndexWriter@43a4181c
          14729 VERBOSE:WRITER8: reopen result (null)
          
          14733 VERBOSE:WRITER7: start reopen from Searcher@79b0d33c main{DirectoryReader(segments_6:1320156373425:nrt _y(4.0):C78/15 _n(4.0):c2/1 _x(4.0):c11/1 _w(4.0):c1 _z(4.0):C7)} writer= org.apache.solr.update.SolrIndexWriter@43a4181c
          
          14920 VERBOSE:READER2: RealTimeGet using searcher  Searcher@79b0d33c main{DirectoryReader(segments_6:1320156373425:nrt _y(4.0):C78/15 _n(4.0):c2/1 _x(4.0):c11/1 _w(4.0):c1 _z(4.0):C7)}
          14922 VERBOSE:READER2: ERROR, id= 98 foundVal= 2 model val= 3 realTime= true
          
          # here were are done -- ie the doc should be in 1320156373429
          14928 VERBOSE:WRITER7: reopen result DirectoryReader(segments_6:1320156373429:nrt _y(4.0):C78/18 _n(4.0):c2/1 _x(4.0):c11/1 _w(4.0):c1 _z(4.0):C7 _10(4.0):c5 _11(4.0):c2)
          

          as I said - we can not rely on the order these logs are printed - this doesn't seem like a bug in lucene no?

          simon

          Show
          Simon Willnauer added a comment - #those two happen at the same time roughly 14728 VERBOSE:WRITER8: start reopen from Searcher@28294f62 main{DirectoryReader(segments_6:1320156373425:nrt _y(4.0):C78/15 _n(4.0):c2/1 _x(4.0):c11/1 _w(4.0):c1 _z(4.0):C7)} writer= org.apache.solr.update.SolrIndexWriter@43a4181c 14729 VERBOSE:WRITER8: reopen result ( null ) 14733 VERBOSE:WRITER7: start reopen from Searcher@79b0d33c main{DirectoryReader(segments_6:1320156373425:nrt _y(4.0):C78/15 _n(4.0):c2/1 _x(4.0):c11/1 _w(4.0):c1 _z(4.0):C7)} writer= org.apache.solr.update.SolrIndexWriter@43a4181c 14920 VERBOSE:READER2: RealTimeGet using searcher Searcher@79b0d33c main{DirectoryReader(segments_6:1320156373425:nrt _y(4.0):C78/15 _n(4.0):c2/1 _x(4.0):c11/1 _w(4.0):c1 _z(4.0):C7)} 14922 VERBOSE:READER2: ERROR, id= 98 foundVal= 2 model val= 3 realTime= true # here were are done -- ie the doc should be in 1320156373429 14928 VERBOSE:WRITER7: reopen result DirectoryReader(segments_6:1320156373429:nrt _y(4.0):C78/18 _n(4.0):c2/1 _x(4.0):c11/1 _w(4.0):c1 _z(4.0):C7 _10(4.0):c5 _11(4.0):c2) as I said - we can not rely on the order these logs are printed - this doesn't seem like a bug in lucene no? simon
          Hide
          Yonik Seeley added a comment -

          as I said - we can not rely on the order these logs are printed

          I think we can rely on that order, and have seen nothing to indicate otherwise.

          #those two happen at the same time roughly

          Right, hence my comment "The update to id:98 may or may not be in the resulting reader above"
          But then after the next openIfChanged() should definitely reflect the update.

          this doesn't seem like a bug in lucene no?

          It absolutely looks like a lucene bug.
          But given that it seems to only happen with a mix of hard and soft commits, it should be easier to track down. First guess: maybe DirectoryReader.applyDeletes is false when it should be true.

          Show
          Yonik Seeley added a comment - as I said - we can not rely on the order these logs are printed I think we can rely on that order, and have seen nothing to indicate otherwise. #those two happen at the same time roughly Right, hence my comment "The update to id:98 may or may not be in the resulting reader above" But then after the next openIfChanged() should definitely reflect the update. this doesn't seem like a bug in lucene no? It absolutely looks like a lucene bug. But given that it seems to only happen with a mix of hard and soft commits, it should be easier to track down. First guess: maybe DirectoryReader.applyDeletes is false when it should be true.
          Hide
          Simon Willnauer added a comment -

          I opened LUCENE-3551 for yet another issue in IW#nrtIsCurrent. I ran TestRealtimeGet over 1k times (trunk settings) without a failure usually this doesn't survive more than 100. yonik can you try to confirm?

          Show
          Simon Willnauer added a comment - I opened LUCENE-3551 for yet another issue in IW#nrtIsCurrent. I ran TestRealtimeGet over 1k times (trunk settings) without a failure usually this doesn't survive more than 100. yonik can you try to confirm?
          Hide
          Yonik Seeley added a comment -

          I just got a failure, but I had turned off verbose and cranked up the iterations to try and detect failures faster.
          I'll try and reproduce with verbose on again.

          Show
          Yonik Seeley added a comment - I just got a failure, but I had turned off verbose and cranked up the iterations to try and detect failures faster. I'll try and reproduce with verbose on again.
          Hide
          Yonik Seeley added a comment -

          OK, here's the latest failure.
          I verified that I'm up-to-date on trunk (as of 11am EST) and that the patch from LUCENE-3551 is applied.

          12125  VERBOSE:WRITER3: updateDocument id:12 Document<stored,indexed,indexOptions=DOCS_ONLY<id:12> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<val_l:63>> org.apache.solr.update.SolrIndexWriter@62d337d3
          12126  VERBOSE:WRITER3: updateDocument id:12 DONE
          
          12920  VERBOSE:WRITER9: start reopen without writer, reader= DirectoryReader(segments_6:1320256147506:nrt _1t(4.0):C21/14 _1u(4.0):C14/11)
          12984  VERBOSE:WRITER9: reopen result DirectoryReader(segments_6:1320256147507:nrt _1t(4.0):C21/14 _1u(4.0):C14/11)
          
          13042  VERBOSE:WRITER7: deleteDocuments id:12 org.apache.solr.update.SolrIndexWriter@62d337d3
          13043  VERBOSE:WRITER7: deleteDocuments id:12 DONE
          
          13249  VERBOSE:WRITER3: start reopen from Searcher@56e20ef8 main{DirectoryReader(segments_6:1320256147507:nrt _1t(4.0):C21/14 _1u(4.0):C14/11)} writer= org.apache.solr.update.SolrIndexWriter@62d337d3
          13250  VERBOSE:WRITER3: reopen result (null)
          
          13255  VERBOSE:READER3: RealTimeGet using searcher  Searcher@1344e858 main{DirectoryReader(segments_6:1320256147507:nrt _1t(4.0):C21/14 _1u(4.0):C14/11)}
          13256  VERBOSE:READER3: ERROR, id= 12 foundVal= 63 model val= -64 realTime= true
          

          I just realized that since this seems to be happening with a mix of soft and hard commits, it's probably a thread safety issue when updates concurrently happen with a commit and I should add prints before/after calling commit also.

          Show
          Yonik Seeley added a comment - OK, here's the latest failure. I verified that I'm up-to-date on trunk (as of 11am EST) and that the patch from LUCENE-3551 is applied. 12125 VERBOSE:WRITER3: updateDocument id:12 Document<stored,indexed,indexOptions=DOCS_ONLY<id:12> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<val_l:63>> org.apache.solr.update.SolrIndexWriter@62d337d3 12126 VERBOSE:WRITER3: updateDocument id:12 DONE 12920 VERBOSE:WRITER9: start reopen without writer, reader= DirectoryReader(segments_6:1320256147506:nrt _1t(4.0):C21/14 _1u(4.0):C14/11) 12984 VERBOSE:WRITER9: reopen result DirectoryReader(segments_6:1320256147507:nrt _1t(4.0):C21/14 _1u(4.0):C14/11) 13042 VERBOSE:WRITER7: deleteDocuments id:12 org.apache.solr.update.SolrIndexWriter@62d337d3 13043 VERBOSE:WRITER7: deleteDocuments id:12 DONE 13249 VERBOSE:WRITER3: start reopen from Searcher@56e20ef8 main{DirectoryReader(segments_6:1320256147507:nrt _1t(4.0):C21/14 _1u(4.0):C14/11)} writer= org.apache.solr.update.SolrIndexWriter@62d337d3 13250 VERBOSE:WRITER3: reopen result ( null ) 13255 VERBOSE:READER3: RealTimeGet using searcher Searcher@1344e858 main{DirectoryReader(segments_6:1320256147507:nrt _1t(4.0):C21/14 _1u(4.0):C14/11)} 13256 VERBOSE:READER3: ERROR, id= 12 foundVal= 63 model val= -64 realTime= true I just realized that since this seems to be happening with a mix of soft and hard commits, it's probably a thread safety issue when updates concurrently happen with a commit and I should add prints before/after calling commit also.
          Hide
          Yonik Seeley added a comment -

          Here's a failure with prints for the commit and with info stream on.

          Key debug outout:

          27273  VERBOSE:WRITER1: start reopen from Searcher@71412b61 main{DirectoryReader(segments_5:1320282632962:nrt _z(4.0):C21/17 _10(4.0):C3/2 _11(4.0):C6/5 _14(4.0):C8/3 _13(4.0):C4/1 _15(4.0):C1 _16(4.0):C5/1 _18(4.0):C5)} writer= org.apache.solr.update.SolrIndexWriter@1badd463
          27321  VERBOSE:WRITER9: deleteDocuments id:14 org.apache.solr.update.SolrIndexWriter@1badd463
          27895  VERBOSE:WRITER9: deleteDocuments id:14 DONE
          28120  VERBOSE:WRITER1: reopen result DirectoryReader(segments_5:1320282632968:nrt _17(4.0):C19/7 _16(4.0):C5/3 _18(4.0):C5/2 _1a(4.0):C6/1 _19(4.0):C2)
          
          28124  VERBOSE:WRITER6: writer.commit() start writer= org.apache.solr.update.SolrIndexWriter@1badd463
          
          28135  VERBOSE:WRITER4: start reopen from Searcher@c16c2c0 main{DirectoryReader(segments_5:1320282632968:nrt _17(4.0):C19/7 _16(4.0):C5/3 _18(4.0):C5/2 _1a(4.0):C6/1 _19(4.0):C2)} writer= org.apache.solr.update.SolrIndexWriter@1badd463
          28136  VERBOSE:WRITER4: reopen result (null)
          
          28159  VERBOSE:READER2: RealTimeGet using searcher  Searcher@962e703 main{DirectoryReader(segments_5:1320282632968:nrt _17(4.0):C19/7 _16(4.0):C5/3 _18(4.0):C5/2 _1a(4.0):C6/1 _19(4.0):C2)}
          28226  VERBOSE:READER2: ERROR, id= 14 foundVal= 17 model val= -18 realTime= true
          
          28355  VERBOSE:WRITER6: writer.commit() end
          

          For this one, it looks like the reopen was concurrent with the delete, and another reopen was concurrent with the commit. Given that we seem to be only seeing these issues now with a mix of hard & soft commits, I'd guess that the issue is when the commit overlaps with a reopen.

          Show
          Yonik Seeley added a comment - Here's a failure with prints for the commit and with info stream on. Key debug outout: 27273 VERBOSE:WRITER1: start reopen from Searcher@71412b61 main{DirectoryReader(segments_5:1320282632962:nrt _z(4.0):C21/17 _10(4.0):C3/2 _11(4.0):C6/5 _14(4.0):C8/3 _13(4.0):C4/1 _15(4.0):C1 _16(4.0):C5/1 _18(4.0):C5)} writer= org.apache.solr.update.SolrIndexWriter@1badd463 27321 VERBOSE:WRITER9: deleteDocuments id:14 org.apache.solr.update.SolrIndexWriter@1badd463 27895 VERBOSE:WRITER9: deleteDocuments id:14 DONE 28120 VERBOSE:WRITER1: reopen result DirectoryReader(segments_5:1320282632968:nrt _17(4.0):C19/7 _16(4.0):C5/3 _18(4.0):C5/2 _1a(4.0):C6/1 _19(4.0):C2) 28124 VERBOSE:WRITER6: writer.commit() start writer= org.apache.solr.update.SolrIndexWriter@1badd463 28135 VERBOSE:WRITER4: start reopen from Searcher@c16c2c0 main{DirectoryReader(segments_5:1320282632968:nrt _17(4.0):C19/7 _16(4.0):C5/3 _18(4.0):C5/2 _1a(4.0):C6/1 _19(4.0):C2)} writer= org.apache.solr.update.SolrIndexWriter@1badd463 28136 VERBOSE:WRITER4: reopen result ( null ) 28159 VERBOSE:READER2: RealTimeGet using searcher Searcher@962e703 main{DirectoryReader(segments_5:1320282632968:nrt _17(4.0):C19/7 _16(4.0):C5/3 _18(4.0):C5/2 _1a(4.0):C6/1 _19(4.0):C2)} 28226 VERBOSE:READER2: ERROR, id= 14 foundVal= 17 model val= -18 realTime= true 28355 VERBOSE:WRITER6: writer.commit() end For this one, it looks like the reopen was concurrent with the delete, and another reopen was concurrent with the commit. Given that we seem to be only seeing these issues now with a mix of hard & soft commits, I'd guess that the issue is when the commit overlaps with a reopen.
          Hide
          Simon Willnauer added a comment -

          Yonik, I isolated a testcase which fails very quickly. See LUCENE-3551

          Show
          Simon Willnauer added a comment - Yonik, I isolated a testcase which fails very quickly. See LUCENE-3551
          Hide
          Yonik Seeley added a comment -

          Great! Trying it out now...

          Show
          Yonik Seeley added a comment - Great! Trying it out now...
          Hide
          Yonik Seeley added a comment -

          OK, 24 hours on my box w/o a fail!

          Show
          Yonik Seeley added a comment - OK, 24 hours on my box w/o a fail!
          Hide
          Simon Willnauer added a comment -

          OK, 24 hours on my box w/o a fail!

          awesome! I am going to commit LUCENE-3551 soon

          Show
          Simon Willnauer added a comment - OK, 24 hours on my box w/o a fail! awesome! I am going to commit LUCENE-3551 soon
          Hide
          Simon Willnauer added a comment -

          this has been fixed in LUCENE-3551

          Show
          Simon Willnauer added a comment - this has been fixed in LUCENE-3551

            People

            • Assignee:
              Simon Willnauer
              Reporter:
              Yonik Seeley
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development