Solr
  1. Solr
  2. SOLR-2565

Prevent IW#close and cut over to IW#commit

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.0-ALPHA
    • Component/s: update
    • Labels:
      None

      Description

      Spinnoff from SOLR-2193. We already have a branch to work on this issue here https://svn.apache.org/repos/asf/lucene/dev/branches/solr2193

      The main goal here is to prevent solr from closing the IW and use IW#commit instead. AFAIK the main issues here are:

      The update handler needs an overhaul.

      A few goals I think we might want to look at:

      1. Expose the SolrIndexWriter in the api or add the proper abstractions to get done what we now do with special casing:
      2. Stop closing the IndexWriter and start using commit (still lazy IW init though).
      3. Drop iwAccess, iwCommit locks and sync mostly at the Lucene level.
      4. Address the current issues we face because multiple original/'reloaded' cores can have a different IndexWriter on the same index.

      Eventually this is a preparation for NRT support in Solr which I will create a followup issue for.

      1. SOLR-2565-revert.patch
        5 kB
        Uwe Schindler
      2. SOLR-2565.patch
        154 kB
        Robert Muir
      3. SOLR-2565.patch
        2 kB
        Yonik Seeley
      4. SOLR-2565.patch
        9 kB
        Mark Miller
      5. SOLR-2565__HuperDuperAutoCommitTest.patch
        19 kB
        Hoss Man
      6. slowtests.txt
        53 kB
        Robert Muir
      7. fix+hossmans-test.patch
        24 kB
        Mark Miller
      8. dump.txt
        9 kB
        Uwe Schindler

        Issue Links

          Activity

          Hide
          Simon Willnauer added a comment -

          we should improve the tests for file handles etc. in solr to make sure all file handles are released properly. This should give more confidence to commit large refactorings in solr

          Show
          Simon Willnauer added a comment - we should improve the tests for file handles etc. in solr to make sure all file handles are released properly. This should give more confidence to commit large refactorings in solr
          Hide
          Robert Muir added a comment -

          I merged the initial commit from SOLR-2279 (which enables MockDirectoryWrapper for solr tests/tracking file handles)... no issues found from the branch.

          I'm gonna look now at some other general ways we can beef up the test coverage in Solr, especially by reusing what we have done in Lucene.

          Show
          Robert Muir added a comment - I merged the initial commit from SOLR-2279 (which enables MockDirectoryWrapper for solr tests/tracking file handles)... no issues found from the branch. I'm gonna look now at some other general ways we can beef up the test coverage in Solr, especially by reusing what we have done in Lucene.
          Hide
          Robert Muir added a comment -

          As far as tests go on this one, personally I am happy.

          It would be good if people could review the branch (its really not a lot of code), and maybe give some opinions on the TODOs such as indexwriter setting changes on core reloads, etc.

          I think it would be better to merge this into trunk sooner than later and iterate in trunk myself.

          Show
          Robert Muir added a comment - As far as tests go on this one, personally I am happy. It would be good if people could review the branch (its really not a lot of code), and maybe give some opinions on the TODOs such as indexwriter setting changes on core reloads, etc. I think it would be better to merge this into trunk sooner than later and iterate in trunk myself.
          Hide
          Robert Muir added a comment -

          here is a patch, showing all changes from the throwaway branch.

          I'm going on vacation for a week, i hope this thing is committed... if not I may just commit it myself the day I come back.

          Show
          Robert Muir added a comment - here is a patch, showing all changes from the throwaway branch. I'm going on vacation for a week, i hope this thing is committed... if not I may just commit it myself the day I come back.
          Hide
          Mark Miller added a comment -

          Threats, threats...

          Show
          Mark Miller added a comment - Threats, threats...
          Hide
          Mark Miller added a comment -

          If this patch applies, I'll commit it.

          Show
          Mark Miller added a comment - If this patch applies, I'll commit it.
          Hide
          Mark Miller added a comment -

          Okay - looks like this applies and tests pass.

          Show
          Mark Miller added a comment - Okay - looks like this applies and tests pass.
          Hide
          Mark Miller added a comment -

          Committed - there is still some wiki work to do.

          Show
          Mark Miller added a comment - Committed - there is still some wiki work to do.
          Hide
          Mark Miller added a comment -

          I've still got to put a note in changes about how you should reload SolrCores after this change.

          Show
          Mark Miller added a comment - I've still got to put a note in changes about how you should reload SolrCores after this change.
          Hide
          Jason Rutherglen added a comment -

          This issue says committed in the comments, however it's status is: "Unresolved"?

          Show
          Jason Rutherglen added a comment - This issue says committed in the comments, however it's status is: "Unresolved"?
          Hide
          Mark Miller added a comment -

          Yeah, sorry - it's open as a reminder for me to make that changes note (or at least evaluate if something should be done) and do the wiki documentation. I'll try and do that tomorrow if I can and get this closed.

          Show
          Mark Miller added a comment - Yeah, sorry - it's open as a reminder for me to make that changes note (or at least evaluate if something should be done) and do the wiki documentation. I'll try and do that tomorrow if I can and get this closed.
          Hide
          Yonik Seeley added a comment -

          I think the removal of all of the synchronization in the update handler leads to a race condition on reopen.

          Basically, the order that we start a reopen in different threads does not have to correspond to the order that the reopen finishes, or the order that the new searcher is registered. This can mean temporarily lost documents - a request can add a document + repoen and due to scheduling have an older reopen registered last in a different thread (meaning the document won't be visible).

          So it looks like we need to synchronize to ensure that only one thread reopens at a time.

          As far as I can tell, I don't think that this synchronization needs to extend to the IW.commit() for a hard commit.

          Show
          Yonik Seeley added a comment - I think the removal of all of the synchronization in the update handler leads to a race condition on reopen. Basically, the order that we start a reopen in different threads does not have to correspond to the order that the reopen finishes, or the order that the new searcher is registered. This can mean temporarily lost documents - a request can add a document + repoen and due to scheduling have an older reopen registered last in a different thread (meaning the document won't be visible). So it looks like we need to synchronize to ensure that only one thread reopens at a time. As far as I can tell, I don't think that this synchronization needs to extend to the IW.commit() for a hard commit.
          Hide
          Yonik Seeley added a comment -

          My TestRealTimeGet currently fails after a few minutes... it may be due to this (hopefully). I'll try fixing.

          Show
          Yonik Seeley added a comment - My TestRealTimeGet currently fails after a few minutes... it may be due to this (hopefully). I'll try fixing.
          Hide
          Yonik Seeley added a comment -

          With the following patch, TestRealTimeGet ran without errors (I stopped it after it ran for over an hour).

          Show
          Yonik Seeley added a comment - With the following patch, TestRealTimeGet ran without errors (I stopped it after it ran for over an hour).
          Hide
          Mark Miller added a comment -

          +1 - makes sense, looks right.

          Without the patch, your stress test fails for me on my new linux box in 6-12 seconds regularly. With the patch I just ran it for half an hour.

          Show
          Mark Miller added a comment - +1 - makes sense, looks right. Without the patch, your stress test fails for me on my new linux box in 6-12 seconds regularly. With the patch I just ran it for half an hour.
          Hide
          Vadim Kisselmann added a comment - - edited

          I tested the nightly build #1595 with an new patch (2565), but NRT doesn't work in my case.

          I index 10 docs/sec, it takes 1-30sec. to see the results.
          same behavior when i update an existing document.

          My addedDate is an timestamp (default="NOW"). In worst case i can see what the document which i indexed is already more when 30
          seconds in my index, but i can't see it.

          My Settings:
          <autoCommit>
          <maxDocs>1000</maxDocs>
          <maxTime>60000</maxTime>
          </autoCommit>

          <autoSoftCommit>
          <maxDocs>1</maxDocs>
          <maxTime>1000</maxTime>
          </autoSoftCommit>

          Are my settings wrong or need you more details?
          Should i use the coldSearcher (default=false)? Or set maxWarmingSearchers higher than 2?

          UPDATE:
          If i only use autoSoftCommit and uncomment autoCommit it works.
          But i should use the "hard" autoCommit, right?
          Mark said yes:
          http://www.lucidimagination.com/blog/2011/07/11/benchmarking-the-new-solr-‘near-realtime’-improvements/

          Show
          Vadim Kisselmann added a comment - - edited I tested the nightly build #1595 with an new patch (2565), but NRT doesn't work in my case. I index 10 docs/sec, it takes 1-30sec. to see the results. same behavior when i update an existing document. My addedDate is an timestamp (default="NOW"). In worst case i can see what the document which i indexed is already more when 30 seconds in my index, but i can't see it. My Settings: <autoCommit> <maxDocs>1000</maxDocs> <maxTime>60000</maxTime> </autoCommit> <autoSoftCommit> <maxDocs>1</maxDocs> <maxTime>1000</maxTime> </autoSoftCommit> Are my settings wrong or need you more details? Should i use the coldSearcher (default=false)? Or set maxWarmingSearchers higher than 2? UPDATE: If i only use autoSoftCommit and uncomment autoCommit it works. But i should use the "hard" autoCommit, right? Mark said yes: http://www.lucidimagination.com/blog/2011/07/11/benchmarking-the-new-solr- ‘near-realtime’-improvements/
          Hide
          Mark Miller added a comment -

          Thanks Vadim - I'll look at this tomorrow.

          Show
          Mark Miller added a comment - Thanks Vadim - I'll look at this tomorrow.
          Hide
          Mark Miller added a comment -

          Checked it real quick before I head out - it's a mistake in trying to not hard and soft commit at the same time. It's just not correct.

          I have a test case and fix - I'll post a patch tomorrow.

          Show
          Mark Miller added a comment - Checked it real quick before I head out - it's a mistake in trying to not hard and soft commit at the same time. It's just not correct. I have a test case and fix - I'll post a patch tomorrow.
          Hide
          Mark Miller added a comment -

          here is the fix and a test, I'll commit soon.

          Show
          Mark Miller added a comment - here is the fix and a test, I'll commit soon.
          Hide
          Vadim Kisselmann added a comment -

          Hi Mark,
          thanks for the quick reply and fix.
          I test it tomorrow at work and report.
          Regards Vadim

          Show
          Vadim Kisselmann added a comment - Hi Mark, thanks for the quick reply and fix. I test it tomorrow at work and report. Regards Vadim
          Hide
          Vadim Kisselmann added a comment -

          Hi Mark,
          it works, thanks!
          The docs are equally available.

          Now i'm fighting with "Internal Server Errors" caused by such queries:
          http://localhost:8080/solr/clustering?wt=ruby&rows=40&start=0&fl=description+pubDate&sort=pubDate+desc&q=london
          It's this Issue: https://issues.apache.org/jira/browse/LUCENE-2208?focusedCommentId=13082315#comment-13082315

          Show
          Vadim Kisselmann added a comment - Hi Mark, it works, thanks! The docs are equally available. Now i'm fighting with "Internal Server Errors" caused by such queries: http://localhost:8080/solr/clustering?wt=ruby&rows=40&start=0&fl=description+pubDate&sort=pubDate+desc&q=london It's this Issue: https://issues.apache.org/jira/browse/LUCENE-2208?focusedCommentId=13082315#comment-13082315
          Hide
          Jason Rutherglen added a comment -

          Can this one be backported to 3.x? It would probably be fairly useful for people to use now?

          Show
          Jason Rutherglen added a comment - Can this one be backported to 3.x? It would probably be fairly useful for people to use now?
          Hide
          Michael McCandless added a comment -

          +1 to backport; this is a serious limitation in Solr.

          ElasticSearch has been using this approach (NRT & commit, not IW.close) for quite some time already.

          Show
          Michael McCandless added a comment - +1 to backport; this is a serious limitation in Solr. ElasticSearch has been using this approach (NRT & commit, not IW.close) for quite some time already.
          Hide
          Robert Muir added a comment -

          thread dumps of slow tests

          Show
          Robert Muir added a comment - thread dumps of slow tests
          Hide
          Uwe Schindler added a comment -

          I get the same thread dumps, sometimes also with a MockDirectoryFactory.maybeYield (seldom), and this one.

          Show
          Uwe Schindler added a comment - I get the same thread dumps, sometimes also with a MockDirectoryFactory.maybeYield (seldom), and this one.
          Hide
          Uwe Schindler added a comment -

          This is the "minimal revert". With this patch applied, the SolrJ tests are running fast again.

          Show
          Uwe Schindler added a comment - This is the "minimal revert". With this patch applied, the SolrJ tests are running fast again.
          Show
          Uwe Schindler added a comment - This is the thread about the slowness: http://www.lucidimagination.com/search/document/a7dcdd77c8b0d14d/solr_tests_in_trunk_sloooooooooooooooooooooooooow
          Hide
          Uwe Schindler added a comment -

          I committed the partial revert in revision: 1159448

          Show
          Uwe Schindler added a comment - I committed the partial revert in revision: 1159448
          Hide
          Hoss Man added a comment -

          Summary of IRC discussion (as i remember it) that didn't make it to the mailing list...

          • Use and rmuir found that they could consistently reproduce "slow" and "fast" behavior in some tests using fixed seeds...
            • ant test -Dtestcase=SolrExampleBinaryTest -Dtests.seed=1:1:1 ... always slow
            • ant test -Dtestcase=SolrExampleBinaryTest -Dtests.seed=2:2:2 ... always fast
          • for Uwe and sarowe (on windows), slow ment "taking so long they finally just had to give up and kill the JVM"
          • For rmuir on his linux box, slow vs. fast was ~1-10 seconds vs 100 seconds
          • For hossman on my linux box, results were roughly consistent with rmuir, but slightly slower since i have a crappier box
          • with the path Uwe commited in r1159448 rmuir, uwe, and sarowe all agreed the tests were "fast" again
          • testUnicode (where the majority of the slowness is seen) is one of the few test methods in these slow classes that uses random testing in a loop, so the compounded speed issues there are likely from some cumulative slow down that is typically only done rarely in other tests (likely "commit" given the nature of the commit that seemed to cause the problem)
          • Jenkin's is using a test multiplier of "3" which is specifically used in testUnicode to pick the number of iterations, so that obviously compounds the effects of this slowdown in jenkins – but makes me highly suspicious about why Uwe and sarowe were seeing so much longer test times on their windows machines then rmuir or i were seeing on linux (since they weren't using any special test multiplier)
          Show
          Hoss Man added a comment - Summary of IRC discussion (as i remember it) that didn't make it to the mailing list... Use and rmuir found that they could consistently reproduce "slow" and "fast" behavior in some tests using fixed seeds... ant test -Dtestcase=SolrExampleBinaryTest -Dtests.seed=1:1:1 ... always slow ant test -Dtestcase=SolrExampleBinaryTest -Dtests.seed=2:2:2 ... always fast for Uwe and sarowe (on windows), slow ment "taking so long they finally just had to give up and kill the JVM" For rmuir on his linux box, slow vs. fast was ~1-10 seconds vs 100 seconds For hossman on my linux box, results were roughly consistent with rmuir, but slightly slower since i have a crappier box with the path Uwe commited in r1159448 rmuir, uwe, and sarowe all agreed the tests were "fast" again testUnicode (where the majority of the slowness is seen) is one of the few test methods in these slow classes that uses random testing in a loop, so the compounded speed issues there are likely from some cumulative slow down that is typically only done rarely in other tests (likely "commit" given the nature of the commit that seemed to cause the problem) Jenkin's is using a test multiplier of "3" which is specifically used in testUnicode to pick the number of iterations, so that obviously compounds the effects of this slowdown in jenkins – but makes me highly suspicious about why Uwe and sarowe were seeing so much longer test times on their windows machines then rmuir or i were seeing on linux (since they weren't using any special test multiplier)
          Hide
          Mark Miller added a comment -

          Hossman has been working on a new test and it has picked up a further issue along the lines of the one Vadim brought up - if using time based auto commit for both hard and soft commits, the soft commits won't happen when triggered by deletes.

          Show
          Mark Miller added a comment - Hossman has been working on a new test and it has picked up a further issue along the lines of the one Vadim brought up - if using time based auto commit for both hard and soft commits, the soft commits won't happen when triggered by deletes.
          Hide
          Hoss Man added a comment -

          As mark mentioned, i started looking into a better overall meme for testing autocommit – i could never reproduce any of the failures charlie cron was complaining about, but the tests didn't make any sense to me anyway.

          path demonstrates a new overall approach, using more detailed monitor of events – not just "the most recent event" but all of them in a queue of (rough) timestamps.

          at the moment one of the tests has an @Ignore:nocommit because of the delete bug that miller mentioned, but it would be helpful to know if people who were seeing problems running AutoCOmmitTest.testSoftAndHardCommitMaxTime (ie: charlie cron and simon) could try this patch out and see if it works better for them.

          Show
          Hoss Man added a comment - As mark mentioned, i started looking into a better overall meme for testing autocommit – i could never reproduce any of the failures charlie cron was complaining about, but the tests didn't make any sense to me anyway. path demonstrates a new overall approach, using more detailed monitor of events – not just "the most recent event" but all of them in a queue of (rough) timestamps. at the moment one of the tests has an @Ignore:nocommit because of the delete bug that miller mentioned, but it would be helpful to know if people who were seeing problems running AutoCOmmitTest.testSoftAndHardCommitMaxTime (ie: charlie cron and simon) could try this patch out and see if it works better for them.
          Hide
          Robert Muir added a comment -

          Looks cool, I'm beasting this test on a couple of machines now...

          Show
          Robert Muir added a comment - Looks cool, I'm beasting this test on a couple of machines now...
          Hide
          Mark Miller added a comment -

          a fix for the delete issue and hossmans test with the test method that was ignored not ignored anymore.

          that test now fails for me due to what looks like a timing issue

          Show
          Mark Miller added a comment - a fix for the delete issue and hossmans test with the test method that was ignored not ignored anymore. that test now fails for me due to what looks like a timing issue
          Hide
          Robert Muir added a comment -

          I ran the original patch (with the @ignore still enabled) 100x each on my fast and slow machine.

          Show
          Robert Muir added a comment - I ran the original patch (with the @ignore still enabled) 100x each on my fast and slow machine.
          Hide
          Yonik Seeley added a comment -

          So now that commits no longer block adds, we should revisit what the best defaults are.
          The transaction logging in SOLR-2700 needs to keep track of uncommitted documents - hence if it doesn't affect performance too much, we should probably commit more often some how. Autocommit based on number of documents doesn't work well across the broad spectrum of users (what value would work well for twitter indexers and book indexers). A size-based approach would probably work best, but we don't have that. Maybe a time based approach? That would limit the transaction log size to the number of documents indexable in a given time period, which should be roughly proportional to the document size. I guess such a time period should be somewhere between 10 and 60 seconds? A lot of data can be indexed in 60 sec, and the goal is to limit the transaction log size while not impacting performance too much due to increased commit frequency.

          The other issue is soft commits... should we configure a soft commitWithin by default (prob within the range of 1-10 sec)?

          Show
          Yonik Seeley added a comment - So now that commits no longer block adds, we should revisit what the best defaults are. The transaction logging in SOLR-2700 needs to keep track of uncommitted documents - hence if it doesn't affect performance too much, we should probably commit more often some how. Autocommit based on number of documents doesn't work well across the broad spectrum of users (what value would work well for twitter indexers and book indexers). A size-based approach would probably work best, but we don't have that. Maybe a time based approach? That would limit the transaction log size to the number of documents indexable in a given time period, which should be roughly proportional to the document size. I guess such a time period should be somewhere between 10 and 60 seconds? A lot of data can be indexed in 60 sec, and the goal is to limit the transaction log size while not impacting performance too much due to increased commit frequency. The other issue is soft commits... should we configure a soft commitWithin by default (prob within the range of 1-10 sec)?
          Hide
          Mark Miller added a comment -

          I'd been considering a default auto-soft commit by time.

          If you think every minute is a good hard commit upper bound due to the transaction log, my initial thoughts are:

          Auto commit by time
          hard commit: 60 seconds
          soft commit: 1 second

          Show
          Mark Miller added a comment - I'd been considering a default auto-soft commit by time. If you think every minute is a good hard commit upper bound due to the transaction log, my initial thoughts are: Auto commit by time hard commit: 60 seconds soft commit: 1 second
          Hide
          Yonik Seeley added a comment -

          I did some performance tests w/ autoCommit=60000 (60 sec) vs no commits at all (10M small docs)
          6 runs for each configuration, times in seconds.

          no commits (not even at the end): 202,222,217,206,203,196 average=207
          autoCommit=60000: 201,206,210,226,220,201 average=210

          So committing every 60 seconds actually came out faster (probably due to chance because of the high variability).

          Aside: there could be a race issue w/ autocommit. I was checking the segments_n files to verify that commits were actually taking place periodically. On a new index, the first file is segments_1. All of the tests ran around 3 minutes 30 seconds... which should have meant 4 more commits, or a segments_5 file after 4 minutes. I only checked a few times, but one time I saw a segments_6, and two other times I saw a segments_9 file. This suggests that too many commits are happening for some reason.

          Show
          Yonik Seeley added a comment - I did some performance tests w/ autoCommit=60000 (60 sec) vs no commits at all (10M small docs) 6 runs for each configuration, times in seconds. no commits (not even at the end): 202,222,217,206,203,196 average=207 autoCommit=60000: 201,206,210,226,220,201 average=210 So committing every 60 seconds actually came out faster (probably due to chance because of the high variability). Aside: there could be a race issue w/ autocommit. I was checking the segments_n files to verify that commits were actually taking place periodically. On a new index, the first file is segments_1. All of the tests ran around 3 minutes 30 seconds... which should have meant 4 more commits, or a segments_5 file after 4 minutes. I only checked a few times, but one time I saw a segments_6, and two other times I saw a segments_9 file. This suggests that too many commits are happening for some reason.
          Hide
          Mark Miller added a comment -

          there could be a race issue w/ autocommit

          I wouldn't be surprised - the AutoCommit tracker has never been 100% thread safe code. Though I did make some improvements, I didn't address everything I saw. I think it's generally been benign stuff though, so could be something else too.

          Show
          Mark Miller added a comment - there could be a race issue w/ autocommit I wouldn't be surprised - the AutoCommit tracker has never been 100% thread safe code. Though I did make some improvements, I didn't address everything I saw. I think it's generally been benign stuff though, so could be something else too.
          Hide
          Mark Miller added a comment -

          Further work to be followed up in other issues - I've put in the basic doc on the wiki around this.

          Show
          Mark Miller added a comment - Further work to be followed up in other issues - I've put in the basic doc on the wiki around this.
          Hide
          Mark Miller added a comment -

          that test now fails for me due to what looks like a timing issue

          Two of the tests in hossmans new AutoCommit test where failing for me on my linux box - i tweaked both to pass and also tested them on my mac. I've also reenabled the AutoCommit test class, but removed the tests involving soft commit - they will be tested with this new method in the new test class instead - later we can move the original auto commit test methods to the new style.

          I'll commit soon.

          Show
          Mark Miller added a comment - that test now fails for me due to what looks like a timing issue Two of the tests in hossmans new AutoCommit test where failing for me on my linux box - i tweaked both to pass and also tested them on my mac. I've also reenabled the AutoCommit test class, but removed the tests involving soft commit - they will be tested with this new method in the new test class instead - later we can move the original auto commit test methods to the new style. I'll commit soon.

            People

            • Assignee:
              Mark Miller
              Reporter:
              Simon Willnauer
            • Votes:
              6 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development