Solr
  1. Solr
  2. SOLR-1711

Race condition in org/apache/solr/client/solrj/impl/StreamingUpdateSolrServer.java

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 1.4, 1.5
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: clients - java
    • Labels:
      None

      Description

      While inserting a large pile of documents using StreamingUpdateSolrServer there is a race condition as all Runner instances stop processing while the blocking queue is full. With a high performance client this could happen quite often, there is no way to recover from it at the client side.

      In StreamingUpdateSolrServer there is a BlockingQueue called queue to store UpdateRequests, there are up to threadCount number of workers threads from StreamingUpdateSolrServer.Runner to read that queue and push requests to a Solr instance. If at one point the BlockingQueue is empty all workers stop processing it and pushing the collected content to Solr which could be a time consuming process, sometimes all worker threads are waiting for Solr. If at this time the client fills the BlockingQueue to full all worker threads will quit without processing any further and the main thread will block forever.

      There is a simple, well tested patch attached to handle this situation.

      1. SOLR-1711.patch
        3 kB
        Yonik Seeley
      2. StreamingUpdateSolrServer.patch
        1 kB
        Attila Babo

        Issue Links

          Activity

          Hide
          Attila Babo added a comment -

          Patch 1, 2:
          Inside the Runner.run method I've added a do while loop to prevent the Runner to quit while there are new requests, this handles the problem of new requests added while Runner is sending the previous batch.

          Patch 3
          Validity check of method variable is not strictly necessary, just a code clean up.

          Patch 4
          The last part of the patch is to move synchronized outside of conditional to avoid a situation where runners change while evaluating it.

          To minify the patch all indentation has been removed.

          Show
          Attila Babo added a comment - Patch 1, 2: Inside the Runner.run method I've added a do while loop to prevent the Runner to quit while there are new requests, this handles the problem of new requests added while Runner is sending the previous batch. Patch 3 Validity check of method variable is not strictly necessary, just a code clean up. Patch 4 The last part of the patch is to move synchronized outside of conditional to avoid a situation where runners change while evaluating it. To minify the patch all indentation has been removed.
          Hide
          Yonik Seeley added a comment -

          Thanks Attila! I just committed this.

          Show
          Yonik Seeley added a comment - Thanks Attila! I just committed this.
          Hide
          Luke Forehand added a comment -

          This is a very serious problem for us. We have multiple threads adding to the StreamingUpdateSolrServer's BlockingQueue, and if I bump the thread count high enough (around 10 for my process) I can reproduce this problem every time. I'd say this bug is critical enough to warrant a SOLR bug-fix release.

          Show
          Luke Forehand added a comment - This is a very serious problem for us. We have multiple threads adding to the StreamingUpdateSolrServer's BlockingQueue, and if I bump the thread count high enough (around 10 for my process) I can reproduce this problem every time. I'd say this bug is critical enough to warrant a SOLR bug-fix release.
          Hide
          Hoss Man added a comment -

          Correcting Fix Version based on CHANGES.txt, see this thread for more details...

          http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E

          Show
          Hoss Man added a comment - Correcting Fix Version based on CHANGES.txt, see this thread for more details... http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E
          Hide
          Hoss Man added a comment -

          Committed revision 949473.

          merged to branch-1.4 for 1.4.1

          Show
          Hoss Man added a comment - Committed revision 949473. merged to branch-1.4 for 1.4.1
          Hide
          Johannes added a comment -

          We are still seeing the same issue with Solr1.4.1

          We get into this situation when all the runner threads die due to a broken pipe, while the BlockingQueue is still full. All of the producer threads are all blocked on the BlockingQueue.put() method. Since the runners are spawned by the producers, which are all blocked, runner threads never get created to drain the queue.

          Here's a potential fix. In the runner code, replace these lines:

          // remove it from the list of running things...
          synchronized (runners)

          { runners.remove( this ); }

          with these lines:

          // remove it from the list of running things unless we are the last runner and the queue is full...
          synchronized (runners) {
          if (runners.size() == 1 && queue.remainingCapacity() == 0)

          { // keep this runner alive scheduler.execute(this); }

          else

          { runners.remove( this ); }

          }

          Show
          Johannes added a comment - We are still seeing the same issue with Solr1.4.1 We get into this situation when all the runner threads die due to a broken pipe, while the BlockingQueue is still full. All of the producer threads are all blocked on the BlockingQueue.put() method. Since the runners are spawned by the producers, which are all blocked, runner threads never get created to drain the queue. Here's a potential fix. In the runner code, replace these lines: // remove it from the list of running things... synchronized (runners) { runners.remove( this ); } with these lines: // remove it from the list of running things unless we are the last runner and the queue is full... synchronized (runners) { if (runners.size() == 1 && queue.remainingCapacity() == 0) { // keep this runner alive scheduler.execute(this); } else { runners.remove( this ); } }
          Hide
          Yonik Seeley added a comment -

          Thanks Johannes, the fix does look correct, and I've committed to 3x and trunk.
          If we have another release of 1.4, we should backport this.

          Show
          Yonik Seeley added a comment - Thanks Johannes, the fix does look correct, and I've committed to 3x and trunk. If we have another release of 1.4, we should backport this.
          Hide
          Aakarsh Nair added a comment -

          We are still seeing this issue even after using Johannes fix. All runners are exiting and the main producer thread hangs on line 196 queue.put . I am thinking it may be because queue is getting drained and filled fast (queue size 50 , number of threads 20) . So there might be a race condition on the queue capacity check.Queue appears to be below capacity to the last runner
          then fills up by simultaneous calls to put . I still see the issue after backporting what is in 3.x branch for testing it with solr 1.4.1. I guess a solution may be to use larger queue capacities for now but the race conditions still seem to be present.

          Show
          Aakarsh Nair added a comment - We are still seeing this issue even after using Johannes fix. All runners are exiting and the main producer thread hangs on line 196 queue.put . I am thinking it may be because queue is getting drained and filled fast (queue size 50 , number of threads 20) . So there might be a race condition on the queue capacity check.Queue appears to be below capacity to the last runner then fills up by simultaneous calls to put . I still see the issue after backporting what is in 3.x branch for testing it with solr 1.4.1. I guess a solution may be to use larger queue capacities for now but the race conditions still seem to be present.
          Hide
          Yonik Seeley added a comment -

          So there might be a race condition on the queue capacity check.

          Yeah. What about moving the queue.put() inside the synchronized(runners) block to fix this?

          Show
          Yonik Seeley added a comment - So there might be a race condition on the queue capacity check. Yeah. What about moving the queue.put() inside the synchronized(runners) block to fix this?
          Hide
          Yonik Seeley added a comment -

          What about moving the queue.put() inside the synchronized(runners) block to fix this?

          On second thought, that looks like a pretty bad idea
          Looks like a recipe for deadlock since the runners lock will be held if put then blocks.

          Show
          Yonik Seeley added a comment - What about moving the queue.put() inside the synchronized(runners) block to fix this? On second thought, that looks like a pretty bad idea Looks like a recipe for deadlock since the runners lock will be held if put then blocks.
          Hide
          Yonik Seeley added a comment -

          Here's a patch that uses offer instead of put in a retry loop.

          Show
          Yonik Seeley added a comment - Here's a patch that uses offer instead of put in a retry loop.
          Hide
          Yonik Seeley added a comment -

          Committed the latest patch - hopefully that finally fixes this issue!

          Show
          Yonik Seeley added a comment - Committed the latest patch - hopefully that finally fixes this issue!
          Hide
          Christian Goeller added a comment -

          I tried out the comitted patch with SOLR 1.4.1 (built SOLR on my own). Unfortunately I still have an issue when having the following scenario:

          Server: Runs SOLR servlet
          Client: Adds documents to the index using the StreamingUpdateSolrServer in several threads.

          If the server crashes or becomes unreachable for the client, the StreamingUpdateSolrServer on the client will end up in an infinite loop, trying to send requests to the server.
          It should be possible to stop/shutdown/dispose the StreamingUpdateSolrServer somehow from another thread.

          Clearing the queue and stopping the executor service from outside stopps the infinite loop.

          Show
          Christian Goeller added a comment - I tried out the comitted patch with SOLR 1.4.1 (built SOLR on my own). Unfortunately I still have an issue when having the following scenario: Server: Runs SOLR servlet Client: Adds documents to the index using the StreamingUpdateSolrServer in several threads. If the server crashes or becomes unreachable for the client, the StreamingUpdateSolrServer on the client will end up in an infinite loop, trying to send requests to the server. It should be possible to stop/shutdown/dispose the StreamingUpdateSolrServer somehow from another thread. Clearing the queue and stopping the executor service from outside stopps the infinite loop.
          Hide
          Grant Ingersoll added a comment -

          Bulk close for 3.1.0 release

          Show
          Grant Ingersoll added a comment - Bulk close for 3.1.0 release

            People

            • Assignee:
              Yonik Seeley
              Reporter:
              Attila Babo
            • Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 1h
                1h
                Remaining:
                Remaining Estimate - 1h
                1h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development