Solr
  1. Solr
  2. SOLR-3221

Make Shard handler threadpool configurable

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.6, 4.0-ALPHA
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: None

      Description

      From profiling of monitor contention, as well as observations of the
      95th and 99th response times for nodes that perform distributed search
      (or ‟aggregator‟ nodes) it would appear that the HttpShardHandler code
      currently does a suboptimal job of managing outgoing shard level
      requests.

      Presently the code contained within lucene 3.5's SearchHandler and
      Lucene trunk / 3x's ShardHandlerFactory create arbitrary threads in
      order to service distributed search requests. This is done presently to
      limit the size of the threadpool such that it does not consume resources
      in deployment configurations that do not use distributed search.

      This unfortunately has two impacts on the response time if the node
      coordinating the distribution is under high load.

      The usage of the MaxConnectionsPerHost configuration option results in
      aggressive activity on semaphores within HttpCommons, it has been
      observed that the aggregator can have a response time far greater than
      that of the searchers. The above monitor contention would appear to
      suggest that in some cases its possible for liveness issues to occur and
      for simple queries to be starved of resources simply due to a lack of
      attention from the viewpoint of context switching.

      With, as mentioned above the http commons connection being hotly
      contended

      The fair, queue based configuration eliminates this, at the cost of
      throughput.

      This patch aims to make the threadpool largely configurable allowing for
      those using solr to choose the throughput vs latency balance they
      desire.

      1. SOLR-3221-3x_branch.patch
        6 kB
        Greg Bowyer
      2. SOLR-3221-trunk.patch
        7 kB
        Greg Bowyer
      3. SOLR-3221-trunk.patch
        9 kB
        Greg Bowyer
      4. SOLR-3221-3x_branch.patch
        8 kB
        Greg Bowyer
      5. SOLR-3221-trunk.patch
        7 kB
        Greg Bowyer
      6. SOLR-3221-3x_branch.patch
        6 kB
        Greg Bowyer
      7. SOLR-3221-trunk.patch
        7 kB
        Greg Bowyer
      8. SOLR-3221-3x_branch.patch
        6 kB
        Greg Bowyer
      9. SOLR-3221-trunk.patch
        7 kB
        Greg Bowyer
      10. SOLR-3221-3x_branch.patch
        6 kB
        Greg Bowyer

        Activity

        Hide
        Jason Rutherglen added a comment -

        +1 Long overdue.

        Show
        Jason Rutherglen added a comment - +1 Long overdue.
        Hide
        Greg Bowyer added a comment -

        Lucene 4.x version

        Show
        Greg Bowyer added a comment - Lucene 4.x version
        Hide
        Greg Bowyer added a comment -

        Whoops ! I didnt spot a silly potential NPE in this patch, found locally by the randomised testing and corrected

        Show
        Greg Bowyer added a comment - Whoops ! I didnt spot a silly potential NPE in this patch, found locally by the randomised testing and corrected
        Hide
        Greg Bowyer added a comment -

        .... And remove the git nonsense from the patch files

        Show
        Greg Bowyer added a comment - .... And remove the git nonsense from the patch files
        Hide
        Greg Bowyer added a comment -

        ... apparently I cannot write code :S

        Show
        Greg Bowyer added a comment - ... apparently I cannot write code :S
        Hide
        Eric Pugh added a comment -

        I've been struggling with the same issue, and resolved it by just bumping up the number of connections per host (see SOLR-2184 which maybe should be marked as a related issue?)

        One thing that I would suggest (and be willing to help with!) is doing a great job of documenting what these parameters do, and potentially listing out as examples the two options for optimizing for throughput versus fair balanced response times, and why you would do one or the other. Exposing these params is good, but can often cause folks to seize on them and use the inappropriately and shoot themselves in the foot. As an example, think of the current issues with the "Optimize" command, where I have seen folks optimize after every commit, because who wouldn't want that!?

        Show
        Eric Pugh added a comment - I've been struggling with the same issue, and resolved it by just bumping up the number of connections per host (see SOLR-2184 which maybe should be marked as a related issue?) One thing that I would suggest (and be willing to help with!) is doing a great job of documenting what these parameters do, and potentially listing out as examples the two options for optimizing for throughput versus fair balanced response times, and why you would do one or the other. Exposing these params is good, but can often cause folks to seize on them and use the inappropriately and shoot themselves in the foot. As an example, think of the current issues with the "Optimize" command, where I have seen folks optimize after every commit, because who wouldn't want that!?
        Hide
        Erick Erickson added a comment -

        The 3x patch runs through all the tests, but the trunk fails.

        The first one at least (and I'd guess the others as well) apparently don't depend on this seed, they fail in IntelliJ too. They all seem to fail on SearchHandler[298]. I think once it was a result of something not being implemented in one of the Mocks, but can't swear to it.

        Errors in the debugger like: java.lang.IllegalArgumentException: Invalid uri 'http://://[::1]:33332/solr/select': escaped absolute path not valid. What????

        I don't have time to look into it now...

        ant test -Dtestcase=TestDistributedGrouping -Dtestmethod=testDistribSearch -Dtests.seed=-7e47ec572525db2b:747a1a2f9e04dca3:611098fa6c7bdbe4 -Dargs="-Dfile.encoding=MacRoman"
        ant test -Dtestcase=FullSolrCloudDistribCmdsTest -Dtestmethod=testDistribSearch -Dtests.seed=666cf0bc1fb2bbe2:-307010e33b27cd24:611098fa6c7bdbe4 -Dargs="-Dfile.encoding=MacRoman"
        ant test -Dtestcase=DistributedSpellCheckComponentTest -Dtestmethod=testDistribSearch -Dtests.seed=-19364590f7798def:3cfbf5694891d7b:611098fa6c7bdbe4 -Dargs="-Dfile.encoding=MacRoman"
        ant test -Dtestcase=TestDistributedSearch -Dtestmethod=testDistribSearch -Dtests.seed=6cdbfdcabaf8471c:af7365c8c8b063:-35863e1b26976412 -Dargs="-Dfile.encoding=MacRoman"
        ant test -Dtestcase=OverseerTest -Dtestmethod=testShardLeaderChange -Dtests.seed=47a8a04f4df29e39:7cf1cab9cde30f9f:-35863e1b26976412 -Dargs="-Dfile.encoding=MacRoman"
        ant test -Dtestcase=BasicDistributedZkTest -Dtestmethod=testDistribSearch -Dtests.seed=-37854d406161a603:-3d7c63121a0b99fe:-a77be13ddf887d7 -Dargs="-Dfile.encoding=MacRoman"
        ant test -Dtestcase=DistributedTermsComponentTest -Dtestmethod=testDistribSearch -Dtests.seed=6fe8b3c1860d0a59:-20d9908f662d6a90:-35863e1b26976412 -Dargs="-Dfile.encoding=MacRoman"

        Show
        Erick Erickson added a comment - The 3x patch runs through all the tests, but the trunk fails. The first one at least (and I'd guess the others as well) apparently don't depend on this seed, they fail in IntelliJ too. They all seem to fail on SearchHandler [298] . I think once it was a result of something not being implemented in one of the Mocks, but can't swear to it. Errors in the debugger like: java.lang.IllegalArgumentException: Invalid uri 'http://:// [::1] :33332/solr/select': escaped absolute path not valid. What???? I don't have time to look into it now... ant test -Dtestcase=TestDistributedGrouping -Dtestmethod=testDistribSearch -Dtests.seed=-7e47ec572525db2b:747a1a2f9e04dca3:611098fa6c7bdbe4 -Dargs="-Dfile.encoding=MacRoman" ant test -Dtestcase=FullSolrCloudDistribCmdsTest -Dtestmethod=testDistribSearch -Dtests.seed=666cf0bc1fb2bbe2:-307010e33b27cd24:611098fa6c7bdbe4 -Dargs="-Dfile.encoding=MacRoman" ant test -Dtestcase=DistributedSpellCheckComponentTest -Dtestmethod=testDistribSearch -Dtests.seed=-19364590f7798def:3cfbf5694891d7b:611098fa6c7bdbe4 -Dargs="-Dfile.encoding=MacRoman" ant test -Dtestcase=TestDistributedSearch -Dtestmethod=testDistribSearch -Dtests.seed=6cdbfdcabaf8471c:af7365c8c8b063:-35863e1b26976412 -Dargs="-Dfile.encoding=MacRoman" ant test -Dtestcase=OverseerTest -Dtestmethod=testShardLeaderChange -Dtests.seed=47a8a04f4df29e39:7cf1cab9cde30f9f:-35863e1b26976412 -Dargs="-Dfile.encoding=MacRoman" ant test -Dtestcase=BasicDistributedZkTest -Dtestmethod=testDistribSearch -Dtests.seed=-37854d406161a603:-3d7c63121a0b99fe:-a77be13ddf887d7 -Dargs="-Dfile.encoding=MacRoman" ant test -Dtestcase=DistributedTermsComponentTest -Dtestmethod=testDistribSearch -Dtests.seed=6fe8b3c1860d0a59:-20d9908f662d6a90:-35863e1b26976412 -Dargs="-Dfile.encoding=MacRoman"
        Hide
        Robert Muir added a comment -
        +    this.scheme = getParameter(args, INIT_URL_SCHEME, "http://");
        +    this.scheme = (!this.scheme.endsWith("://")) ? this.scheme : this.scheme + "://";
        

        Looks like this log: adds the :// again if it already exists?

        Show
        Robert Muir added a comment - + this.scheme = getParameter(args, INIT_URL_SCHEME, "http://"); + this.scheme = (!this.scheme.endsWith("://")) ? this.scheme : this.scheme + "://"; Looks like this log: adds the :// again if it already exists?
        Hide
        Greg Bowyer added a comment - - edited

        yes thats the problem, I will recut both patches in a moment

        Show
        Greg Bowyer added a comment - - edited yes thats the problem, I will recut both patches in a moment
        Hide
        Greg Bowyer added a comment -

        Fixed the schema issue

        Show
        Greg Bowyer added a comment - Fixed the schema issue
        Hide
        Erick Erickson added a comment -

        OK, this passes all the tests on my mac. I'll have a chance to look over the code today. I'll commit this today or tomorrow unless somebody finds something.

        Show
        Erick Erickson added a comment - OK, this passes all the tests on my mac. I'll have a chance to look over the code today. I'll commit this today or tomorrow unless somebody finds something.
        Hide
        Erick Erickson added a comment -

        OK, this passes all the tests on my mac. I'll have a chance to look over the code today. I'll commit this today or tomorrow unless somebody finds something.

        Show
        Erick Erickson added a comment - OK, this passes all the tests on my mac. I'll have a chance to look over the code today. I'll commit this today or tomorrow unless somebody finds something.
        Hide
        Yonik Seeley added a comment -

        The above monitor contention would appear to suggest that in some cases its possible for liveness issues to occur and for simple queries to be starved of resources simply due to a lack of attention from the viewpoint of context switching.

        that's pretty interesting... how many concurrent requests were you running through a single aggregator?

        Show
        Yonik Seeley added a comment - The above monitor contention would appear to suggest that in some cases its possible for liveness issues to occur and for simple queries to be starved of resources simply due to a lack of attention from the viewpoint of context switching. that's pretty interesting... how many concurrent requests were you running through a single aggregator?
        Hide
        Greg Bowyer added a comment -

        About 100 Requests/second

        Show
        Greg Bowyer added a comment - About 100 Requests/second
        Hide
        Erick Erickson added a comment -

        Trunk r: 1300160
        3.6 r: 1300158

        Big thanks Greg!

        Greg:
        How about an entry in CHANGES.txt? And perhaps a short example somewhere on the Wiki explaining how to make this happen?

        Show
        Erick Erickson added a comment - Trunk r: 1300160 3.6 r: 1300158 Big thanks Greg! Greg: How about an entry in CHANGES.txt? And perhaps a short example somewhere on the Wiki explaining how to make this happen?
        Hide
        Greg Bowyer added a comment -

        No problems, how would I change the wiki, just go ahead and edit it ?

        I am loathe to submit a patch for changing the CHANGES.txt, but is that the same as the wiki ?

        Show
        Greg Bowyer added a comment - No problems, how would I change the wiki, just go ahead and edit it ? I am loathe to submit a patch for changing the CHANGES.txt, but is that the same as the wiki ?
        Hide
        Erick Erickson added a comment -

        Yeah, just go ahead and edit the Wiki, all you need to do is create an account.

        As for CHANGES.txt. Just attach it to this JIRA and I'll go ahead and commit it without a new JIRA. Two tricky things:
        1> there are two files, one in the 3x branch and one in the 4x, both need the text
        2> The 4x branch needs to be edited in two places, one for the 3.x section and in the 4x section.

        I'll see it when if this JIRA changes and check them in.

        Thanks!

        Show
        Erick Erickson added a comment - Yeah, just go ahead and edit the Wiki, all you need to do is create an account. As for CHANGES.txt. Just attach it to this JIRA and I'll go ahead and commit it without a new JIRA. Two tricky things: 1> there are two files, one in the 3x branch and one in the 4x, both need the text 2> The 4x branch needs to be edited in two places, one for the 3.x section and in the 4x section. I'll see it when if this JIRA changes and check them in. Thanks!
        Hide
        Greg Bowyer added a comment -

        I added some information to the wiki in the solrconfig.xml section talking about the shardHandlerFactory that you might want to review. Documentation is not my strong point

        also can we can put the following into the CHANGES.TXT

        • SOLR-3221: Added the ability to directly configure aspects of the concurrency
          and thread-pooling used within distributed search in solr. This allows for finer
          grained controlled and can be tuned by end users to target their own specific
          requirements. This builds on the work of the HttpCommComponent and uses the same
          configuration block to configure the thread pool. The default configuration has
          the same behaviour as solr 3.5, favouring throughput over latency. More
          information can be found on the wiki (http://wiki.apache.org/solr/SolrConfigXml)
        Show
        Greg Bowyer added a comment - I added some information to the wiki in the solrconfig.xml section talking about the shardHandlerFactory that you might want to review. Documentation is not my strong point also can we can put the following into the CHANGES.TXT SOLR-3221 : Added the ability to directly configure aspects of the concurrency and thread-pooling used within distributed search in solr. This allows for finer grained controlled and can be tuned by end users to target their own specific requirements. This builds on the work of the HttpCommComponent and uses the same configuration block to configure the thread pool. The default configuration has the same behaviour as solr 3.5, favouring throughput over latency. More information can be found on the wiki ( http://wiki.apache.org/solr/SolrConfigXml )
        Hide
        David Smiley added a comment -

        Curious; why was the default choice made to favor throughput over low latency? Generally speaking, low latency in search is more important. There should be no issue of backwards compatibility, either.

        Show
        David Smiley added a comment - Curious; why was the default choice made to favor throughput over low latency? Generally speaking, low latency in search is more important. There should be no issue of backwards compatibility, either.
        Hide
        Greg Bowyer added a comment -

        Only to support the exacting behaviour that was in solr prior to this patch.

        Show
        Greg Bowyer added a comment - Only to support the exacting behaviour that was in solr prior to this patch.
        Hide
        Mark Miller added a comment -

        I am loathe to submit a patch for changing the CHANGES.txt

        Usually the committer handles this - no rules about it though.

        I'd go with something a bit shorter - no need to get into the gritty details - that's why the JIRA issue number is there. I'd stick to something closer to "Make shard handler threadpool configurable." or "Added the ability to directly configure aspects of the concurrency and thread-pooling used within distributed search in solr."

        Show
        Mark Miller added a comment - I am loathe to submit a patch for changing the CHANGES.txt Usually the committer handles this - no rules about it though. I'd go with something a bit shorter - no need to get into the gritty details - that's why the JIRA issue number is there. I'd stick to something closer to "Make shard handler threadpool configurable." or "Added the ability to directly configure aspects of the concurrency and thread-pooling used within distributed search in solr."
        Hide
        Greg Bowyer added a comment -

        Sorry the changes.txt change was done so I dont think there is anything left for this jira ticket, I think Erick added the changes.txt for the 3.6 release

        Show
        Greg Bowyer added a comment - Sorry the changes.txt change was done so I dont think there is anything left for this jira ticket, I think Erick added the changes.txt for the 3.6 release
        Hide
        David Smiley added a comment -

        Only to support the exacting behaviour that was in solr prior to this patch.

        I wouldn't of made the same decision but it's moot now, esp. for 3x. I don't see how this change would break anybody, and I see how it would improve.

        I think Solr 4 should have better defaults; agreed?

        Show
        David Smiley added a comment - Only to support the exacting behaviour that was in solr prior to this patch. I wouldn't of made the same decision but it's moot now, esp. for 3x. I don't see how this change would break anybody, and I see how it would improve. I think Solr 4 should have better defaults; agreed?
        Hide
        Markus Jelsma added a comment -

        I would agree that latency is preferred as default.

        Show
        Markus Jelsma added a comment - I would agree that latency is preferred as default.
        Hide
        Greg Bowyer added a comment -

        I agree, I was being cowardly when I wrote it because I am not a committer

        Show
        Greg Bowyer added a comment - I agree, I was being cowardly when I wrote it because I am not a committer
        Hide
        David Smiley added a comment -

        Greg, I heard you intend to add a small patch to flip Solr 4's default on this feature? I was picking Erick since forever and he pawned it off on you.

        Show
        David Smiley added a comment - Greg, I heard you intend to add a small patch to flip Solr 4's default on this feature? I was picking Erick since forever and he pawned it off on you.

          People

          • Assignee:
            Erick Erickson
            Reporter:
            Greg Bowyer
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development