Solr
  1. Solr
  2. SOLR-6136

ConcurrentUpdateSolrServer includes a Spin Lock

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 4.6, 4.6.1, 4.7, 4.7.1, 4.7.2, 4.8, 4.8.1
    • Fix Version/s: 4.10, 6.0
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      ConcurrentUpdateSolrServer.blockUntilFinished() includes a Spin Lock. This causes an extremely high amount of CPU to be used on the Cloud Leader during indexing.

      Here is a summary of our system testing.

      Importing data on Solr4.5.0:
      Throughput gets as high as 240 documents per second.

      [tomcat@solr-stg01 logs]$ uptime
      09:53:50 up 310 days, 23:52, 1 user, load average: 3.33, 3.72, 5.43

      PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
      9547 tomcat 21 0 6850m 1.2g 16m S 86.2 5.0 1:48.81 java

      Importing data on Solr4.7.0 with no replicas:
      Throughput peaks at 350 documents per second.

      [tomcat@solr-stg01 logs]$ uptime
      10:03:44 up 311 days, 2 min, 1 user, load average: 4.57, 2.55, 4.18

      PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
      9728 tomcat 23 0 6859m 2.2g 28m S 62.3 9.0 2:20.20 java

      Importing data on Solr4.7.0 with replicas:
      Throughput peaks at 30 documents per second because the Solr machine is out of CPU.

      [tomcat@solr-stg01 logs]$ uptime
      09:40:04 up 310 days, 23:38, 1 user, load average: 30.54, 12.39, 4.79

      PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
      9190 tomcat 17 0 7005m 397m 15m S 198.5 1.6 7:14.87 java

      1. SOLR-6136.patch
        15 kB
        Timothy Potter
      2. wait___notify_all.patch
        9 kB
        Brandon Chapman

        Issue Links

          Activity

          Hide
          Brandon Chapman added a comment -

          Applying the patch from the linked ticket to Solr 4.5 will cause the same issue to be present in Solr 4.5.

          Show
          Brandon Chapman added a comment - Applying the patch from the linked ticket to Solr 4.5 will cause the same issue to be present in Solr 4.5.
          Hide
          Brandon Chapman added a comment -

          Attached patch for Solr 4.7.1 drastically improves performance. The patch is a workaround of the spin lock by using a simple wait / notify mechanism. It is not a suggestion on how to fix ConcurrentUpdateSolrServer for an official release.

          Show
          Brandon Chapman added a comment - Attached patch for Solr 4.7.1 drastically improves performance. The patch is a workaround of the spin lock by using a simple wait / notify mechanism. It is not a suggestion on how to fix ConcurrentUpdateSolrServer for an official release.
          Hide
          Timothy Potter added a comment -

          Thanks for the patch Brandon! I'll start working on this issue tomorrow unless someone can dig into it sooner.

          Show
          Timothy Potter added a comment - Thanks for the patch Brandon! I'll start working on this issue tomorrow unless someone can dig into it sooner.
          Hide
          Mark Miller added a comment -

          I'll def help with any thoughts or review on this one - important issue to fix.

          Show
          Mark Miller added a comment - I'll def help with any thoughts or review on this one - important issue to fix.
          Hide
          Timothy Potter added a comment -

          Thanks Mark. My strategy is to use Brandon's wait/notify around the runners object, which seems clean to me. The only other thing his patch needs is to use your fix for SOLR-4260 (the check for non-empty queue and no runners). Unless you guys see a problem with using wait/notify on the runners object?

          Main thing is the testing ... I've yet to actually see this block of code show up as a CPU hotspot in my testing I'm trying some things on a CPU constrained node with complex docs to see if I can get it to trigger.

          Show
          Timothy Potter added a comment - Thanks Mark. My strategy is to use Brandon's wait/notify around the runners object, which seems clean to me. The only other thing his patch needs is to use your fix for SOLR-4260 (the check for non-empty queue and no runners). Unless you guys see a problem with using wait/notify on the runners object? Main thing is the testing ... I've yet to actually see this block of code show up as a CPU hotspot in my testing I'm trying some things on a CPU constrained node with complex docs to see if I can get it to trigger.
          Hide
          Shalin Shekhar Mangar added a comment -

          Main thing is the testing ... I've yet to actually see this block of code show up as a CPU hotspot in my testing I'm trying some things on a CPU constrained node with complex docs to see if I can get it to trigger.

          It might be easier to trigger if you send docs individually at a high rate instead of a batching them.

          Show
          Shalin Shekhar Mangar added a comment - Main thing is the testing ... I've yet to actually see this block of code show up as a CPU hotspot in my testing I'm trying some things on a CPU constrained node with complex docs to see if I can get it to trigger. It might be easier to trigger if you send docs individually at a high rate instead of a batching them.
          Hide
          Timothy Potter added a comment -

          Thanks Shalin ... that helped a little. Here's where I'm at with this: using the current "spin-lock" code on trunk (without wait/notify), VisualVM shows some self CPU time in the blockUntilFinished, with Brandon's patch + the fix for SOLR-4260 added back in, the self CPU time is negligible, which aligns with what Brandon reported So I think I'm convinced this is a good improvement and am now going to run some stress tests on this and then post a patch for review. Also, surprisingly, CUSS didn't have a unit test case for it, so I'm building that too.

          Show
          Timothy Potter added a comment - Thanks Shalin ... that helped a little. Here's where I'm at with this: using the current "spin-lock" code on trunk (without wait/notify), VisualVM shows some self CPU time in the blockUntilFinished, with Brandon's patch + the fix for SOLR-4260 added back in, the self CPU time is negligible, which aligns with what Brandon reported So I think I'm convinced this is a good improvement and am now going to run some stress tests on this and then post a patch for review. Also, surprisingly, CUSS didn't have a unit test case for it, so I'm building that too.
          Hide
          Timothy Potter added a comment -

          Mark - just curious - what was the thinking on using only 1 thread for the CUSS used by StreamingSolrServers? Specifically, I'm looking at this code in StreamingSolrServers:

          server = new ConcurrentUpdateSolrServer(url, httpClient, 100, 1, updateExecutor, true) {

          If I had to guess it was that you didn't want to over-complicate debugging when this code was new? Wondering if you see value in increasing the thread count a little?

          Show
          Timothy Potter added a comment - Mark - just curious - what was the thinking on using only 1 thread for the CUSS used by StreamingSolrServers? Specifically, I'm looking at this code in StreamingSolrServers: server = new ConcurrentUpdateSolrServer(url, httpClient, 100, 1, updateExecutor, true) { If I had to guess it was that you didn't want to over-complicate debugging when this code was new? Wondering if you see value in increasing the thread count a little?
          Hide
          Mark Miller added a comment -

          I intended on making it configurable eventually, but I think it should default to 1. We should be artful about spinning up too many threads by default for a single request I think. Once we fix distributed deadlock, I'd like it if you could largely control the number of threads proportionally just by using the container thread pool sizing config.

          Show
          Mark Miller added a comment - I intended on making it configurable eventually, but I think it should default to 1. We should be artful about spinning up too many threads by default for a single request I think. Once we fix distributed deadlock, I'd like it if you could largely control the number of threads proportionally just by using the container thread pool sizing config.
          Hide
          Timothy Potter added a comment -

          Ok, makes sense ... I'm wondering if it doesn't make sense to be a little more flexible if the queue is filling up? Currently, the code only allocates a new Runner if:

          if (runners.isEmpty() || (queue.remainingCapacity() < queue.size() && runners.size() < threadCount)) {

          My thinking is to relax this a little to allow an additional runner if the queue is half full but otherwise just keep it at 1.

          Show
          Timothy Potter added a comment - Ok, makes sense ... I'm wondering if it doesn't make sense to be a little more flexible if the queue is filling up? Currently, the code only allocates a new Runner if: if (runners.isEmpty() || (queue.remainingCapacity() < queue.size() && runners.size() < threadCount)) { My thinking is to relax this a little to allow an additional runner if the queue is half full but otherwise just keep it at 1.
          Hide
          Mark Miller added a comment -

          My thinking is to relax this a little to allow an additional runner if the queue is half full but otherwise just keep it at 1.

          That sounds okay to me - my only worry is doing something that spins up too many threads for a small queue.

          Show
          Mark Miller added a comment - My thinking is to relax this a little to allow an additional runner if the queue is half full but otherwise just keep it at 1. That sounds okay to me - my only worry is doing something that spins up too many threads for a small queue.
          Hide
          Timothy Potter added a comment -

          Here's a patch based largely on Brandon's original patch, using wait / notifyAll instead of the spin lock in blockUntilFinished. As mentioned above, VisualVM shows good evidence of this improvement in that the amount of CPU spent in the block method is negligible with this patch (and very noticeable without it).

          I've also included the first cut at a unit test for CUSS. There's probably more things we can do to exercise the logic in CUSS, so let me know if you have any other ideas for the unit test.

          Brandon - please try this patch out in your environment if possible. I'll plan to commit this to trunk and backport to 4x branch in a few days after keeping on eye on things in Jenkins.

          Show
          Timothy Potter added a comment - Here's a patch based largely on Brandon's original patch, using wait / notifyAll instead of the spin lock in blockUntilFinished. As mentioned above, VisualVM shows good evidence of this improvement in that the amount of CPU spent in the block method is negligible with this patch (and very noticeable without it). I've also included the first cut at a unit test for CUSS. There's probably more things we can do to exercise the logic in CUSS, so let me know if you have any other ideas for the unit test. Brandon - please try this patch out in your environment if possible. I'll plan to commit this to trunk and backport to 4x branch in a few days after keeping on eye on things in Jenkins.
          Hide
          Timothy Potter added a comment -

          btw - I decided to not mess with the threadCount stuff Mark and I were discussing here as that should be handled under another "improvement" ticket after doing some benchmarking to show if it even helps.

          Show
          Timothy Potter added a comment - btw - I decided to not mess with the threadCount stuff Mark and I were discussing here as that should be handled under another "improvement" ticket after doing some benchmarking to show if it even helps.
          Hide
          Mark Miller added a comment -
          -            final UpdateRequest updateRequest = queue.poll(250,
          -                TimeUnit.MILLISECONDS);
          +            final UpdateRequest updateRequest = 
          +                queue.poll(pollQueueTime, TimeUnit.MILLISECONDS);
                       if (updateRequest == null)
          

          Know when that bug was introduced? If it went out in 4.9, that is a pretty severe performance bug if you are not streaming or batching big.

          Show
          Mark Miller added a comment - - final UpdateRequest updateRequest = queue.poll(250, - TimeUnit.MILLISECONDS); + final UpdateRequest updateRequest = + queue.poll(pollQueueTime, TimeUnit.MILLISECONDS); if (updateRequest == null) Know when that bug was introduced? If it went out in 4.9, that is a pretty severe performance bug if you are not streaming or batching big.
          Hide
          Timothy Potter added a comment -

          not sure about that one ... just something I caught while working on this issue

          Show
          Timothy Potter added a comment - not sure about that one ... just something I caught while working on this issue
          Hide
          Mark Miller added a comment -

          Ah, I see, it's a different poll call. I only had this take affect on one of the poll calls because that was enough to relieve the performance issue in my benchmarks. I think it makes sense to expand it to the this other poll call as well.

          +1 on the patch, looks okay to me, tests pass locally.

          I don't want to think about testing CUSS at the moment, but nice work on a new test for it. Will be great to have it grow - this has been a troublesome class to stabilize over the years.

          Show
          Mark Miller added a comment - Ah, I see, it's a different poll call. I only had this take affect on one of the poll calls because that was enough to relieve the performance issue in my benchmarks. I think it makes sense to expand it to the this other poll call as well. +1 on the patch, looks okay to me, tests pass locally. I don't want to think about testing CUSS at the moment, but nice work on a new test for it. Will be great to have it grow - this has been a troublesome class to stabilize over the years.
          Hide
          Mark Miller added a comment -

          I don't know if it's this or not, but right around when this went in I started seeing the rare hang at:

           [junit4]    >         at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer.blockUntilFinished(ConcurrentUpdateSolrServer.java:374)
             [junit4]    >         at org.apache.solr.update.StreamingSolrServers.blockUntilFinished(StreamingSolrServers.java:103)
             [junit4]    >         at org.apache.solr.update.SolrCmdDistributor.blockAndDoRetries(SolrCmdDistributor.java:228)
             [junit4]    >         at org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:89)
          
          Show
          Mark Miller added a comment - I don't know if it's this or not, but right around when this went in I started seeing the rare hang at: [junit4] > at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer.blockUntilFinished(ConcurrentUpdateSolrServer.java:374) [junit4] > at org.apache.solr.update.StreamingSolrServers.blockUntilFinished(StreamingSolrServers.java:103) [junit4] > at org.apache.solr.update.SolrCmdDistributor.blockAndDoRetries(SolrCmdDistributor.java:228) [junit4] > at org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:89)
          Hide
          Timothy Potter added a comment -

          ugh ... weird though as the notifyAll call that would free up the hang at that spot is in a finally block you'd think it would be getting called correctly ... the only path I can see that notify wouldn't get called is the queue never empties.

          Show
          Timothy Potter added a comment - ugh ... weird though as the notifyAll call that would free up the hang at that spot is in a finally block you'd think it would be getting called correctly ... the only path I can see that notify wouldn't get called is the queue never empties.
          Hide
          Mark Miller added a comment -

          I filed SOLR-6406 to track this.

          Show
          Mark Miller added a comment - I filed SOLR-6406 to track this.

            People

            • Assignee:
              Timothy Potter
              Reporter:
              Brandon Chapman
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development