Solr
  1. Solr
  2. SOLR-65

Multithreaded DirectUpdateHandler2

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.1.0
    • Component/s: update
    • Labels:
      None

      Description

      Basic implementation of autoCommi functionality, plus overhaul of DUH2 threading to reduce contention

      1. autocommit_patch.diff
        36 kB
        Mike Klaas
      2. autocommit_patch.diff
        33 kB
        Mike Klaas
      3. autocommit_patch.diff
        31 kB
        Mike Klaas
      4. autocommit_patch.diff
        24 kB
        Mike Klaas

        Issue Links

          Activity

          Hide
          Mike Klaas added a comment -

          Basic autocommit implementation.

          Not complete: this patch still requires a scheduled call to checkCommit() to have the correct semantics--it is something I'm working on.

          I posted this patch to get feedback on the thread synchronization strategy. I eliminated all synchronized blocks, and instead am using a reader/writer lock from java.util.concurrent for synchro. Virtually everything is serialized using the writer lock (like the old strategy), but document adding degrades its lock to a lowered-contention reader lock before adding hte document to the index.

          Show
          Mike Klaas added a comment - Basic autocommit implementation. Not complete: this patch still requires a scheduled call to checkCommit() to have the correct semantics--it is something I'm working on. I posted this patch to get feedback on the thread synchronization strategy. I eliminated all synchronized blocks, and instead am using a reader/writer lock from java.util.concurrent for synchro. Virtually everything is serialized using the writer lock (like the old strategy), but document adding degrades its lock to a lowered-contention reader lock before adding hte document to the index.
          Hide
          Yonik Seeley added a comment -

          The multithreaded stuff looks very nice!

          I like the idea of separating out the autoCommit functionallity into a tracker.

          I notice that in closeWriter() you call tracker.didCommit()... but I think of a commit more as making sure everything is on stable storage, and making those changes visible. closeWriter() is also currently called as an implementation side effect of delete-by-query.

          There are some different uses for commit control that I can think of:
          1) users don't want to manually send commits, and want to ensure that any changes made to the index will be visible within a certain amount of time.
          2) in the event of multiple updaters all calling commit often, the index owner wants to limit the commit frequency to avoid search degredation (or OOM errors from too many overlapping on-deck searchers).
          3) there are infrequent single updates to an index, but when they happen a commit should be done ASAP to lower the latency till visibility. Subject to limitations in (2)

          It's possible people might even want tuning on the fly.
          For instance, if you are about to do a compete index rebuild, it would be nice to be able to turn off any autocommitting or commits alltogether to avoid partial indexes being visible. This could be done with special update commands, or perhaps flags on the <add> command?

          Show
          Yonik Seeley added a comment - The multithreaded stuff looks very nice! I like the idea of separating out the autoCommit functionallity into a tracker. I notice that in closeWriter() you call tracker.didCommit()... but I think of a commit more as making sure everything is on stable storage, and making those changes visible. closeWriter() is also currently called as an implementation side effect of delete-by-query. There are some different uses for commit control that I can think of: 1) users don't want to manually send commits, and want to ensure that any changes made to the index will be visible within a certain amount of time. 2) in the event of multiple updaters all calling commit often, the index owner wants to limit the commit frequency to avoid search degredation (or OOM errors from too many overlapping on-deck searchers). 3) there are infrequent single updates to an index, but when they happen a commit should be done ASAP to lower the latency till visibility. Subject to limitations in (2) It's possible people might even want tuning on the fly. For instance, if you are about to do a compete index rebuild, it would be nice to be able to turn off any autocommitting or commits alltogether to avoid partial indexes being visible. This could be done with special update commands, or perhaps flags on the <add> command?
          Hide
          Mike Klaas added a comment -

          > The multithreaded stuff looks very nice!

          Thanks – the new concurrent package is quite cool (it much better fits my mental model of thread synchro).

          About the closeWriter() -> absolutely correct. I had totally forgotten about the importance of opening a new searcher as part of the commit.

          1) sounds good.
          2) does this include explicit commits received? Might this make it tricky to test (both on solr's side and in userland)? Perhaps the default commit could be an absolute directive, but could accept an optional parameter that allows the commit to be cancelled by the intervalLowerBound parameter. Something like <commit hint="true"/>. Or it could be a new command <softcommit/>. autocommits would also fall into that category.
          3) would this case be covered by the client sending an <add> followed by <softcommit/>?

          The on-the-fly tuning is possible for maxDocs autocommit but isn't practical for maxTime, since the time-based auto-commit is a system-wide property (whereas maxDocs are only checked within the context of a given request, so it is trivial to turn off checking locally). It might be better to have some sort of global parameter tuning possible here (jmx??).

          -Mike

          Show
          Mike Klaas added a comment - > The multithreaded stuff looks very nice! Thanks – the new concurrent package is quite cool (it much better fits my mental model of thread synchro). About the closeWriter() -> absolutely correct. I had totally forgotten about the importance of opening a new searcher as part of the commit. 1) sounds good. 2) does this include explicit commits received? Might this make it tricky to test (both on solr's side and in userland)? Perhaps the default commit could be an absolute directive, but could accept an optional parameter that allows the commit to be cancelled by the intervalLowerBound parameter. Something like <commit hint="true"/>. Or it could be a new command <softcommit/>. autocommits would also fall into that category. 3) would this case be covered by the client sending an <add> followed by <softcommit/>? The on-the-fly tuning is possible for maxDocs autocommit but isn't practical for maxTime, since the time-based auto-commit is a system-wide property (whereas maxDocs are only checked within the context of a given request, so it is trivial to turn off checking locally). It might be better to have some sort of global parameter tuning possible here (jmx??). -Mike
          Hide
          Hoss Man added a comment -

          Thining about this purely from a configuration standpoint, without any regard to implimentation complexity, the most general pupose options i can think of would be:

          commitFrequencyHardMin (milliseconds)
          commitFrequencySoftMin (milliseconds greater then commitFrequencyHardMin)
          commitFrequencyTestDelay (milliseconds)
          commitFrequencySoftMax (milliseconds greater then commitFrequencySoftMin)
          commitFrequencyHardMax (milliseconds greater then commitFrequencySoftMax)
          autoCommitDocs (integer)

          ...assuming timeSinceLastCommit, timeSinceLastAddOrDelete, and numDocsAddedSinceLastCommit are available to is, these are the situations in which a commit would happen...

          a) commitFrequencyHardMax <= timeSinceLastCommit
          b) commitFrequencySoftMax <= timeSinceLastCommit < commitFrequencyHardMax
          && commitFrequencyTestDelay <= timeSinceLastAddOrDelete
          c) commitFrequencySoftMin <= timeSinceLastCommit < commitFrequencySoftMax
          && [commit command recieved]
          d) commitFrequencyHardMin <= timeSinceLastCommit < commitFrequencySoftMin
          && commitFrequencyTestDelay <= timeSinceLastAddOrDelete
          && [commit command recieved]

          The situation where autoCommitDocs <= numDocsAddedSinceLastCommit would be generate a commit command (just as if a user had issued it)

          I think that various usages of those params (assuming they could be modified on the lfy using JMX or some new command) would cover all of the use cases yonik described, as well as all the ones i can think of.

          (implimenting that may be overkill however)

          Show
          Hoss Man added a comment - Thining about this purely from a configuration standpoint, without any regard to implimentation complexity, the most general pupose options i can think of would be: commitFrequencyHardMin (milliseconds) commitFrequencySoftMin (milliseconds greater then commitFrequencyHardMin) commitFrequencyTestDelay (milliseconds) commitFrequencySoftMax (milliseconds greater then commitFrequencySoftMin) commitFrequencyHardMax (milliseconds greater then commitFrequencySoftMax) autoCommitDocs (integer) ...assuming timeSinceLastCommit, timeSinceLastAddOrDelete, and numDocsAddedSinceLastCommit are available to is, these are the situations in which a commit would happen... a) commitFrequencyHardMax <= timeSinceLastCommit b) commitFrequencySoftMax <= timeSinceLastCommit < commitFrequencyHardMax && commitFrequencyTestDelay <= timeSinceLastAddOrDelete c) commitFrequencySoftMin <= timeSinceLastCommit < commitFrequencySoftMax && [commit command recieved] d) commitFrequencyHardMin <= timeSinceLastCommit < commitFrequencySoftMin && commitFrequencyTestDelay <= timeSinceLastAddOrDelete && [commit command recieved] The situation where autoCommitDocs <= numDocsAddedSinceLastCommit would be generate a commit command (just as if a user had issued it) I think that various usages of those params (assuming they could be modified on the lfy using JMX or some new command) would cover all of the use cases yonik described, as well as all the ones i can think of. (implimenting that may be overkill however)
          Hide
          Mike Klaas added a comment -

          It certainly appears that there is far more to this functionality than I have put thought into (mostly as the usage patterns are rather different from my usage of solr), and I'm inclined to think that this functionality is better implemented by someone with more time/motivation (especially considering we have a volunteer).

          The important part of this patch for me is the multi-threaded indexing, and a max-doc autocommit would also be nice. I also think that these two items are the least controversial. How about I reduce the scope of the patch to those two items, leaving potential for more autocommit constraints?

          Show
          Mike Klaas added a comment - It certainly appears that there is far more to this functionality than I have put thought into (mostly as the usage patterns are rather different from my usage of solr), and I'm inclined to think that this functionality is better implemented by someone with more time/motivation (especially considering we have a volunteer). The important part of this patch for me is the multi-threaded indexing, and a max-doc autocommit would also be nice. I also think that these two items are the least controversial. How about I reduce the scope of the patch to those two items, leaving potential for more autocommit constraints?
          Hide
          Yonik Seeley added a comment -

          > How about I reduce the scope of the patch to those two items, leaving potential for more autocommit constraints?
          Sounds good.

          Show
          Yonik Seeley added a comment - > How about I reduce the scope of the patch to those two items, leaving potential for more autocommit constraints? Sounds good.
          Hide
          Mike Klaas added a comment -

          New patch.

          First, the locking semantics actually were wrong. Since ever addDoc call grabbed the commit lock and downgraded to access lock, subsequent calls would block on the commit. I tried a few vastly different schemes, and it took a while to figure out something that allowed concurrency but also gave the same protections as before. I finally settled on using the read/write commit lock as the principal lock, with a touch of synchronization to protect the addDoc calls.

          That finally enabled concurrency, but other bottlenecks emerged. checkCommit() was grabbing the commit lock, which created a barrier at the end of every addDoc call which was forced to wait for all pending addDoc calls. Switched to synchro on the tracker (synchronizing on DUH2 would provoke a potential deadlock).

          Finally, there was significant contention on the lock for the logger output stream. When merging wasn't occuring, the doc rate could reach 200-300 dps, and each docId was being logged. I modified the bulk add code to log the docid of all documents in a single log statement. While I was at it, I converted the <result> output for multi-adds to a single xml element. Was more information going to be added to this?

          The gains of multi-threaded indexing for my application are modest. The cpu usage is >100% consistently; it drops a bit during medium merges and drops a lot during large merges (merges effectively serialize adding documents). Still, the throughput gain is about 20-30%. In retrospect, this isn't terribly surprising, as our analysis is relatively modest. Applications with heavier analysis needs would see more gains.

          Show
          Mike Klaas added a comment - New patch. First, the locking semantics actually were wrong. Since ever addDoc call grabbed the commit lock and downgraded to access lock, subsequent calls would block on the commit. I tried a few vastly different schemes, and it took a while to figure out something that allowed concurrency but also gave the same protections as before. I finally settled on using the read/write commit lock as the principal lock, with a touch of synchronization to protect the addDoc calls. That finally enabled concurrency, but other bottlenecks emerged. checkCommit() was grabbing the commit lock, which created a barrier at the end of every addDoc call which was forced to wait for all pending addDoc calls. Switched to synchro on the tracker (synchronizing on DUH2 would provoke a potential deadlock). Finally, there was significant contention on the lock for the logger output stream. When merging wasn't occuring, the doc rate could reach 200-300 dps, and each docId was being logged. I modified the bulk add code to log the docid of all documents in a single log statement. While I was at it, I converted the <result> output for multi-adds to a single xml element. Was more information going to be added to this? The gains of multi-threaded indexing for my application are modest. The cpu usage is >100% consistently; it drops a bit during medium merges and drops a lot during large merges (merges effectively serialize adding documents). Still, the throughput gain is about 20-30%. In retrospect, this isn't terribly surprising, as our analysis is relatively modest. Applications with heavier analysis needs would see more gains.
          Hide
          Mike Klaas added a comment -

          This is the issue that the patch belongs to. sorry.

          Show
          Mike Klaas added a comment - This is the issue that the patch belongs to. sorry.
          Hide
          Mike Klaas added a comment -

          The last patch has a screwup which causes a few tests to fail... this will be fixed in the next version.

          Show
          Mike Klaas added a comment - The last patch has a screwup which causes a few tests to fail... this will be fixed in the next version.
          Hide
          Mike Klaas added a comment -

          This version removes the attempt at parsing the rest of the xml if an error occurs during document update.

          It has mostly already been reviewed, but to summarize, this patch:

          • improves concurrency during multi-threaded document update
          • potentially optimizes huge commits (may prevent a stall if memory is constrained)
          • eliminates the edge cases of document update (esp. multi-doc update) which causes
            solr to return xml that couldn't be parsed as a single document (should also solve SOLR-2; SOLR-54)

          If no-one has objections, I'll commit the patch tomorrow.

          Show
          Mike Klaas added a comment - This version removes the attempt at parsing the rest of the xml if an error occurs during document update. It has mostly already been reviewed, but to summarize, this patch: improves concurrency during multi-threaded document update potentially optimizes huge commits (may prevent a stall if memory is constrained) eliminates the edge cases of document update (esp. multi-doc update) which causes solr to return xml that couldn't be parsed as a single document (should also solve SOLR-2 ; SOLR-54 ) If no-one has objections, I'll commit the patch tomorrow.
          Hide
          Yonik Seeley added a comment -

          Everything looks good to me.

          > While I was at it, I converted the <result> output for multi-adds to a single xml element. Was more information going to be added to this?

          I don't think so.

          > Still, the throughput gain is about 20-30%.

          Not too bad... is that for a dual CPU machine? The number of cores per chip and cores per server are going up (quad cores, Sun niagra, etc), so these types of optimizations will grow in importance.

          I wonder what kind of gains could be had if Lucene could overlap adding of documents (via buffering) with merging of segments.

          Show
          Yonik Seeley added a comment - Everything looks good to me. > While I was at it, I converted the <result> output for multi-adds to a single xml element. Was more information going to be added to this? I don't think so. > Still, the throughput gain is about 20-30%. Not too bad... is that for a dual CPU machine? The number of cores per chip and cores per server are going up (quad cores, Sun niagra, etc), so these types of optimizations will grow in importance. I wonder what kind of gains could be had if Lucene could overlap adding of documents (via buffering) with merging of segments.
          Hide
          Mike Klaas added a comment -

          [[ Old comment, sent from unregistered email on Wed, 8 Nov 2006 13:43:59 -0800 ]]

          Yep, that's a 2-cpu machine.

          A good question--it is definitely possible to test such a scenario by
          manually creating ram segments and adding them (now that
          addIndicesNoOptimize exists).

          It still might be possible to improve parallelism on the solr end by
          allows doc adds and doDeletions to run concurrently, but if that was a
          major bottleneck then the commit frequency could be reduced anyway.

          Thanks for the review,
          -Mike

          Show
          Mike Klaas added a comment - [[ Old comment, sent from unregistered email on Wed, 8 Nov 2006 13:43:59 -0800 ]] Yep, that's a 2-cpu machine. A good question--it is definitely possible to test such a scenario by manually creating ram segments and adding them (now that addIndicesNoOptimize exists). It still might be possible to improve parallelism on the solr end by allows doc adds and doDeletions to run concurrently, but if that was a major bottleneck then the commit frequency could be reduced anyway. Thanks for the review, -Mike
          Hide
          Mike Klaas added a comment -

          Checked in patch r472720

          Show
          Mike Klaas added a comment - Checked in patch r472720
          Hide
          Hoss Man added a comment -

          This bug was modified as part of a bulk update using the criteria...

          • Marked ("Resolved" or "Closed") and "Fixed"
          • Had no "Fix Version" versions
          • Was listed in the CHANGES.txt for 1.1

          The Fix Version for all 38 issues found was set to 1.1, email notification
          was suppressed to prevent excessive email.

          For a list of all the issues modified, search jira comments for this
          (hopefully) unique string: 20080415hossman3

          Show
          Hoss Man added a comment - This bug was modified as part of a bulk update using the criteria... Marked ("Resolved" or "Closed") and "Fixed" Had no "Fix Version" versions Was listed in the CHANGES.txt for 1.1 The Fix Version for all 38 issues found was set to 1.1, email notification was suppressed to prevent excessive email. For a list of all the issues modified, search jira comments for this (hopefully) unique string: 20080415hossman3

            People

            • Assignee:
              Mike Klaas
              Reporter:
              Mike Klaas
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development