Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.5
    • Component/s: None
    • Labels:
      None

      Description

      autocommit seems to commit more frequently than configured.

      1. SOLR-2748_solr33.patch
        28 kB
        Jan Høydahl
      2. SOLR-2748.patch
        11 kB
        Yonik Seeley
      3. SOLR-2748.patch
        7 kB
        Yonik Seeley

        Activity

        Hide
        Yonik Seeley added a comment -

        So I think the culprit might be CommitTracker.run()
        At the end (after doing the commit) it has this code:

            // check if docs have been submitted since the commit started
            if (lastAddedTime > started) {
              if (docsUpperBound > 0 && docsSinceCommit.get() > docsUpperBound) {
                pending = scheduler.schedule(this, 100, TimeUnit.MILLISECONDS);
              } else if (timeUpperBound > 0) {
                pending = scheduler.schedule(this, timeUpperBound,
                    TimeUnit.MILLISECONDS);
              }
            }
        

        Which seems to blindly schedule another commit (which should have already been scheduled?). So now we have 2 commits scheduled for the next round, where there should have only been one. It seems like those two commits now have the potential to turn into 4 for the next round, and so on.

        Show
        Yonik Seeley added a comment - So I think the culprit might be CommitTracker.run() At the end (after doing the commit) it has this code: // check if docs have been submitted since the commit started if (lastAddedTime > started) { if (docsUpperBound > 0 && docsSinceCommit.get() > docsUpperBound) { pending = scheduler.schedule( this , 100, TimeUnit.MILLISECONDS); } else if (timeUpperBound > 0) { pending = scheduler.schedule( this , timeUpperBound, TimeUnit.MILLISECONDS); } } Which seems to blindly schedule another commit (which should have already been scheduled?). So now we have 2 commits scheduled for the next round, where there should have only been one. It seems like those two commits now have the potential to turn into 4 for the next round, and so on.
        Hide
        Yonik Seeley added a comment - - edited

        Hmmm, I'm not quite understanding the logic of this class.
        CommitTracker.didCommit() (which is called by DUH2 after a commit finishes) tries to cancel any pending scheduled operations and resets the doc counter to 0. But that seems like a bug since documents may have been added during the commit, and a new commit may have been scheduled while the old commit was executing. Of course we are going to lose track of that since run() sets pending to null.

        edit: and of course trying to cancel "pending" from the DUH2 like that is also another race condition... which "pending" will it cancel? It could be the old or the new, depending on the thread timings.

        Show
        Yonik Seeley added a comment - - edited Hmmm, I'm not quite understanding the logic of this class. CommitTracker.didCommit() (which is called by DUH2 after a commit finishes) tries to cancel any pending scheduled operations and resets the doc counter to 0. But that seems like a bug since documents may have been added during the commit, and a new commit may have been scheduled while the old commit was executing. Of course we are going to lose track of that since run() sets pending to null. edit: and of course trying to cancel "pending" from the DUH2 like that is also another race condition... which "pending" will it cancel? It could be the old or the new, depending on the thread timings.
        Hide
        Yonik Seeley added a comment - - edited

        Another interesting thing I ran into reviewing this code is that the CommitTracker.run() method is synchronized, and so is _scheduleCommitWithin(), meaning (I think) that a long running commit will block anything calling that method (and an autoCommit by time or an add with a commitWithin specified would qualify).

        edit: and I seem to recall some people reporting that adds were blocked until a new searcher was registered (but I could never reproduce it myself). This is probably the root cause of those problems.

        Show
        Yonik Seeley added a comment - - edited Another interesting thing I ran into reviewing this code is that the CommitTracker.run() method is synchronized, and so is _scheduleCommitWithin(), meaning (I think) that a long running commit will block anything calling that method (and an autoCommit by time or an add with a commitWithin specified would qualify). edit: and I seem to recall some people reporting that adds were blocked until a new searcher was registered (but I could never reproduce it myself). This is probably the root cause of those problems.
        Hide
        Yonik Seeley added a comment -

        OK, here's a first cut at a patch trying to fix some of these issues.
        AutoCommitTest.testSoftCommitMaxTime() is failing once in a while though... not sure what's up yet.

        Show
        Yonik Seeley added a comment - OK, here's a first cut at a patch trying to fix some of these issues. AutoCommitTest.testSoftCommitMaxTime() is failing once in a while though... not sure what's up yet.
        Hide
        Yonik Seeley added a comment -

        After adding a bunch of prints, I think this is a test bug.
        Just because a newSearcher callback has been issued (triggered) does not mean that a new searcher has been registered yet.

        Show
        Yonik Seeley added a comment - After adding a bunch of prints, I think this is a test bug. Just because a newSearcher callback has been issued (triggered) does not mean that a new searcher has been registered yet.
        Hide
        Jason Rutherglen added a comment -

        Seeing all of the bugs related to the Solr NRT code, I can't help but wonder why the 4.x version of the project needs to be backward compatible.

        Also why it's not using IndexReaderWarmer which was ostensibly created precisely for Solr's usage (and, it's not used in Solr and never has been).

        Show
        Jason Rutherglen added a comment - Seeing all of the bugs related to the Solr NRT code, I can't help but wonder why the 4.x version of the project needs to be backward compatible. Also why it's not using IndexReaderWarmer which was ostensibly created precisely for Solr's usage (and, it's not used in Solr and never has been).
        Hide
        Yonik Seeley added a comment -

        Seeing all of the bugs related to the Solr NRT code, I can't help but wonder why the 4.x version of the project needs to be backward compatible.

        It's not really related to NRT since the autoCommit (CommitTracker) code has been around for a very long time (way before NRT).

        Show
        Yonik Seeley added a comment - Seeing all of the bugs related to the Solr NRT code, I can't help but wonder why the 4.x version of the project needs to be backward compatible. It's not really related to NRT since the autoCommit (CommitTracker) code has been around for a very long time (way before NRT).
        Hide
        Yonik Seeley added a comment -

        Found another race in AutoCommitTest that was causing a failure... the autoCommitCount is incremented after the commit returns, so depending on thread scheduling, the test can be triggered to continue after the commit, but before autoCommitCount is incremented.

        Show
        Yonik Seeley added a comment - Found another race in AutoCommitTest that was causing a failure... the autoCommitCount is incremented after the commit returns, so depending on thread scheduling, the test can be triggered to continue after the commit, but before autoCommitCount is incremented.
        Hide
        Yonik Seeley added a comment -

        OK, here's a patch (with still some extra loggong) that fixes some of the bugs in AutoCommitTest, and also moves up where autoCommitCount is incremented to before the commit happens.

        With this patch, things seem much better... but I still got 2 failures in AutoCommitTest out of 45 runs. Next I plan to repeat some of my performance tests where I was seeing way too many commits for what was configured.

        Hoss also pointed out "see SOLR-2565 and the HuperDuperAutoCommitTest for an in progress attempt at making it work better"

        Show
        Yonik Seeley added a comment - OK, here's a patch (with still some extra loggong) that fixes some of the bugs in AutoCommitTest, and also moves up where autoCommitCount is incremented to before the commit happens. With this patch, things seem much better... but I still got 2 failures in AutoCommitTest out of 45 runs. Next I plan to repeat some of my performance tests where I was seeing way too many commits for what was configured. Hoss also pointed out "see SOLR-2565 and the HuperDuperAutoCommitTest for an in progress attempt at making it work better"
        Hide
        Yonik Seeley added a comment -

        My 10M doc performance testing now shows the correct number of commits.
        I also tried turning on soft autocommits at 1 sec, and that also resulted in the correct number of soft commits being done.

        Oddly enough the autoCommit + softAutoCommit test ran in 2:35 sec, while the autoCommit only test ran in 3:33.
        One explanation could be that DWPT doesn't necessarily seem optimal for older (non-SSD) drives (Erick reported seeing trunk as slower than 3x on his system with a spinning-magnets type drive), and the smaller segments avoided some of this.
        The other explanation (and this one actually makes more sense to me) is that the CSV loader used is single-threaded. Adding the first 1000 documents to a small segment is probably more efficient than adding the last 1000 to a larger segment. Doing more soft commits means creating smaller segments and doing more work in background merging using other CPU cores (basically, it increased the parallelism).

        Show
        Yonik Seeley added a comment - My 10M doc performance testing now shows the correct number of commits. I also tried turning on soft autocommits at 1 sec, and that also resulted in the correct number of soft commits being done. Oddly enough the autoCommit + softAutoCommit test ran in 2:35 sec, while the autoCommit only test ran in 3:33. One explanation could be that DWPT doesn't necessarily seem optimal for older (non-SSD) drives (Erick reported seeing trunk as slower than 3x on his system with a spinning-magnets type drive), and the smaller segments avoided some of this. The other explanation (and this one actually makes more sense to me) is that the CSV loader used is single-threaded. Adding the first 1000 documents to a small segment is probably more efficient than adding the last 1000 to a larger segment. Doing more soft commits means creating smaller segments and doing more work in background merging using other CPU cores (basically, it increased the parallelism).
        Hide
        Yonik Seeley added a comment -

        I re-ran the 10M doc test with soft autocommit set to 10ms (obviously too low, but I just wanted to make sure that things didn't blow up). Things went fine, no exceptions, etc, and it did manage to commit at a rate of 73 commits/sec while indexing. Should be even higher if logging is turned off.

        Show
        Yonik Seeley added a comment - I re-ran the 10M doc test with soft autocommit set to 10ms (obviously too low, but I just wanted to make sure that things didn't blow up). Things went fine, no exceptions, etc, and it did manage to commit at a rate of 73 commits/sec while indexing. Should be even higher if logging is turned off.
        Hide
        Yonik Seeley added a comment -

        I've been through a few hundred iterations of AutoCommitTest with no failures (with logging turned on).
        I think it's time to clean up the debugging logs and commit!

        Show
        Yonik Seeley added a comment - I've been through a few hundred iterations of AutoCommitTest with no failures (with logging turned on). I think it's time to clean up the debugging logs and commit!
        Hide
        Ryan McKinley added a comment -

        Thanks for looking into this! i vaguely recall stumbling through this with Mike Klaas 5 years ago!

        +1

        Show
        Ryan McKinley added a comment - Thanks for looking into this! i vaguely recall stumbling through this with Mike Klaas 5 years ago! +1
        Hide
        Yonik Seeley added a comment -

        committed. will backport to 3x next.

        Show
        Yonik Seeley added a comment - committed. will backport to 3x next.
        Hide
        Jan Høydahl added a comment -

        Needed to apply this for a client on 3.3, so here's the patch

        Show
        Jan Høydahl added a comment - Needed to apply this for a client on 3.3, so here's the patch
        Hide
        Cassandra Targett added a comment -

        It appears this was fixed long, long ago but simply missed at the time. I see an entry in CHANGES.txt for 3.5, and there is this commit in the history: https://git1-us-west.apache.org/repos/asf?p=lucene-solr.git;a=commit;h=013e2776b867cc7f087ec17207a2720f2806689e.

        If it should still be open, my mistake, please reopen.

        Show
        Cassandra Targett added a comment - It appears this was fixed long, long ago but simply missed at the time. I see an entry in CHANGES.txt for 3.5, and there is this commit in the history: https://git1-us-west.apache.org/repos/asf?p=lucene-solr.git;a=commit;h=013e2776b867cc7f087ec17207a2720f2806689e . If it should still be open, my mistake, please reopen.

          People

          • Assignee:
            Yonik Seeley
            Reporter:
            Yonik Seeley
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development