Solr
  1. Solr
  2. SOLR-2308

Race condition still exists in StreamingUpdateSolrServer which could cause it to hang

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.4.1
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: clients - java
    • Labels:
      None

      Description

      We are still seeing the same issue as SOLR-1711 & SOLR-1885 with Solr1.4.1
      We get into this situation when all the runner threads die due to a broken pipe, while the BlockingQueue is still full. All of the producer threads are all blocked on the BlockingQueue.put() method. Since the runners are spawned by the producers, which are all blocked, runner threads never get created to drain the queue.

      Here's a potential fix. In the runner code, replace these lines:

      // remove it from the list of running things...
      synchronized (runners) {
          runners.remove( this );
      }
      

      with these lines:

      // remove it from the list of running things unless we are the last runner and the queue is full...
      synchronized (runners) {
          if (runners.size() == 1 && queue.remainingCapacity() == 0) {
              // keep this runner alive
              scheduler.execute(this);
          } else {
              runners.remove( this );
          }
      }
      

        Issue Links

          Activity

          Johannes created issue -
          Johannes made changes -
          Field Original Value New Value
          Description We are still seeing the same issue as SOLR-1711 & SOLR-1885 with Solr1.4.1
          We get into this situation when all the runner threads die due to a broken pipe, while the BlockingQueue is still full. All of the producer threads are all blocked on the BlockingQueue.put() method. Since the runners are spawned by the producers, which are all blocked, runner threads never get created to drain the queue.

          Here's a potential fix. In the runner code, replace these lines:

          // remove it from the list of running things...
          synchronized (runners) { runners.remove( this ); }

          with these lines:

          // remove it from the list of running things unless we are the last runner and the queue is full...
          synchronized (runners) {
          if (runners.size() == 1 && queue.remainingCapacity() == 0) { // keep this runner alive scheduler.execute(this); } else { runners.remove( this ); }
          }
          We are still seeing the same issue as SOLR-1711 & SOLR-1885 with Solr1.4.1
          We get into this situation when all the runner threads die due to a broken pipe, while the BlockingQueue is still full. All of the producer threads are all blocked on the BlockingQueue.put() method. Since the runners are spawned by the producers, which are all blocked, runner threads never get created to drain the queue.

          Here's a potential fix. In the runner code, replace these lines:

          {code}
          // remove it from the list of running things...
          synchronized (runners) { runners.remove( this ); }
          {code}

          with these lines:

          {code}
          // remove it from the list of running things unless we are the last runner and the queue is full...
          synchronized (runners) {
          if (runners.size() == 1 && queue.remainingCapacity() == 0) { // keep this runner alive scheduler.execute(this); } else { runners.remove( this ); }
          }
          {code}
          Johannes made changes -
          Description We are still seeing the same issue as SOLR-1711 & SOLR-1885 with Solr1.4.1
          We get into this situation when all the runner threads die due to a broken pipe, while the BlockingQueue is still full. All of the producer threads are all blocked on the BlockingQueue.put() method. Since the runners are spawned by the producers, which are all blocked, runner threads never get created to drain the queue.

          Here's a potential fix. In the runner code, replace these lines:

          {code}
          // remove it from the list of running things...
          synchronized (runners) { runners.remove( this ); }
          {code}

          with these lines:

          {code}
          // remove it from the list of running things unless we are the last runner and the queue is full...
          synchronized (runners) {
          if (runners.size() == 1 && queue.remainingCapacity() == 0) { // keep this runner alive scheduler.execute(this); } else { runners.remove( this ); }
          }
          {code}
          We are still seeing the same issue as SOLR-1711 & SOLR-1885 with Solr1.4.1
          We get into this situation when all the runner threads die due to a broken pipe, while the BlockingQueue is still full. All of the producer threads are all blocked on the BlockingQueue.put() method. Since the runners are spawned by the producers, which are all blocked, runner threads never get created to drain the queue.

          Here's a potential fix. In the runner code, replace these lines:

          {code}
          // remove it from the list of running things...
          synchronized (runners) {
              runners.remove( this );
          }
          {code}

          with these lines:

          {code}
          // remove it from the list of running things unless we are the last runner and the queue is full...
          synchronized (runners) {
              if (runners.size() == 1 && queue.remainingCapacity() == 0) {
                  // keep this runner alive
                  scheduler.execute(this);
              } else {
                  runners.remove( this );
              }
          }
          {code}
          Hide
          Russell Teabeault added a comment - - edited

          It seems that this condition happens when HttpClient.executeMethod(method) is called in the Runner and it throws an exception. This would cause the runner to get removed from the runners queue. And if this happened quickly for all the runners then callers to the request method would block trying to add the request to the blocking queue if the request queue was full.

          Would another possible solution be to add a catch for the try block where HttpClient.executeMethod(method) is called?

              int statusCode = getHttpClient().executeMethod(method);
              if (statusCode != HttpStatus.SC_OK) {
                  StringBuilder msg = new StringBuilder();
                  msg.append( method.getStatusLine().getReasonPhrase() );
                  msg.append( "\n\n" );
                  msg.append( method.getStatusText() );
                  msg.append( "\n\n" );
                  msg.append( "request: "+method.getURI() );
                  handleError( new Exception( msg.toString() ) );
               }
           }  /** catch added here */
           catch(Exception e) {
               handleError( e);
           } finally {
             try {
                // make sure to release the connection
                 if(method != null)
                     method.releaseConnection();
                }
                catch( Exception ex ){}
          }
          
          Show
          Russell Teabeault added a comment - - edited It seems that this condition happens when HttpClient.executeMethod(method) is called in the Runner and it throws an exception. This would cause the runner to get removed from the runners queue. And if this happened quickly for all the runners then callers to the request method would block trying to add the request to the blocking queue if the request queue was full. Would another possible solution be to add a catch for the try block where HttpClient.executeMethod(method) is called? int statusCode = getHttpClient().executeMethod(method); if (statusCode != HttpStatus.SC_OK) { StringBuilder msg = new StringBuilder(); msg.append( method.getStatusLine().getReasonPhrase() ); msg.append( "\n\n" ); msg.append( method.getStatusText() ); msg.append( "\n\n" ); msg.append( "request: " +method.getURI() ); handleError( new Exception( msg.toString() ) ); } } /** catch added here */ catch (Exception e) { handleError( e); } finally { try { // make sure to release the connection if (method != null ) method.releaseConnection(); } catch ( Exception ex ){} }
          Hide
          Yonik Seeley added a comment -

          Oops, I saw Johannes fix posted in SOLR-1744 and didn't know this issue existed.
          Anyway, this should now be fixed in 3x and trunk.

          Show
          Yonik Seeley added a comment - Oops, I saw Johannes fix posted in SOLR-1744 and didn't know this issue existed. Anyway, this should now be fixed in 3x and trunk.
          Yonik Seeley made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Fix Version/s 3.1 [ 12314371 ]
          Fix Version/s 4.0 [ 12314992 ]
          Resolution Fixed [ 1 ]
          Hide
          Grant Ingersoll added a comment -

          Bulk close for 3.1.0 release

          Show
          Grant Ingersoll added a comment - Bulk close for 3.1.0 release
          Grant Ingersoll made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Jan Høydahl made changes -
          Link This issue duplicates SOLR-1711 [ SOLR-1711 ]
          Hide
          Jan Høydahl added a comment -

          Linking to the issue that actually got committed

          Show
          Jan Høydahl added a comment - Linking to the issue that actually got committed
          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Resolved Resolved
          23d 2h 26m 1 Yonik Seeley 30/Jan/11 01:48
          Resolved Resolved Closed Closed
          59d 13h 57m 1 Grant Ingersoll 30/Mar/11 16:46

            People

            • Assignee:
              Unassigned
              Reporter:
              Johannes
            • Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development