Solr
  1. Solr
  2. SOLR-906

Buffered / Streaming SolrServer implementaion

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.4
    • Component/s: clients - java
    • Labels:
      None

      Description

      While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal. This makes a new request for each document.

      With a "StreamingHttpSolrServer", documents are buffered and then written to a single open Http connection.

      For related discussion see:
      http://www.nabble.com/solr-performance-tt9055437.html#a20833680

      1. StreamingHttpSolrServer.java
        7 kB
        Ryan McKinley
      2. SOLR-906-StreamingHttpSolrServer.patch
        12 kB
        Ryan McKinley
      3. SOLR-906-StreamingHttpSolrServer.patch
        12 kB
        Ryan McKinley
      4. SOLR-906-StreamingHttpSolrServer.patch
        12 kB
        Ryan McKinley
      5. SOLR-906-StreamingHttpSolrServer.patch
        13 kB
        Ryan McKinley

        Activity

        Hide
        Grant Ingersoll added a comment -

        Bulk close for Solr 1.4

        Show
        Grant Ingersoll added a comment - Bulk close for Solr 1.4
        Hide
        Ryan McKinley added a comment -
        if we modify the XMLLoader to accept multiple <add>

        The XMLLoader already accepts multiple <add> commands without any changes.

        The <stream> tag (or <whatever> tag) is added so that the xml is valid – you can not have multiple roots in an xml document.

        Again – nothing has changed with the parser.

        Show
        Ryan McKinley added a comment - if we modify the XMLLoader to accept multiple <add> The XMLLoader already accepts multiple <add> commands without any changes. The <stream> tag (or <whatever> tag) is added so that the xml is valid – you can not have multiple roots in an xml document. Again – nothing has changed with the parser.
        Hide
        Noble Paul added a comment - - edited

        That is just there so that multiple <add> commands can be in the same XML document.

        It may not be necessary to have the <strream> tag , if we modify the XMLLoader to accept multiple <add> we can do away with the extra <stream> tag.

        The XML format is a public interface and we should be conservative in adding stuff into that .

        It would be nice to document the public changes in the JIRA itself so that other committers can see the changes which are going to come in w/o going through the patch itself

        Show
        Noble Paul added a comment - - edited That is just there so that multiple <add> commands can be in the same XML document. It may not be necessary to have the <strream> tag , if we modify the XMLLoader to accept multiple <add> we can do away with the extra <stream> tag. The XML format is a public interface and we should be conservative in adding stuff into that . It would be nice to document the public changes in the JIRA itself so that other committers can see the changes which are going to come in w/o going through the patch itself
        Hide
        Ryan McKinley added a comment -
        But we need to change the global lock to be final.

        why? The global lock is used to block all the threads – each worker checks if it is null to see if it should block or not.

        The only place that sets the lock is within the same synchronized block:

        
          public synchronized void blockUntilFinished()
          {
            if( lock == null ) {
              lock = new ReentrantLock();
            }
            lock.lock();
        
            ...
        
            lock.unlock();
            lock = null;
          }
        

        since nothing else changes lock, i think it is ok.

        What does the "<stream>" tag do?

        That is just there so that multiple <add> commands can be in the same XML document. it is just an arbitrary parent tag. The parser on the other end only validates once it hits a known cmd tag.

        Show
        Ryan McKinley added a comment - But we need to change the global lock to be final. why? The global lock is used to block all the threads – each worker checks if it is null to see if it should block or not. The only place that sets the lock is within the same synchronized block: public synchronized void blockUntilFinished() { if ( lock == null ) { lock = new ReentrantLock(); } lock.lock(); ... lock.unlock(); lock = null ; } since nothing else changes lock, i think it is ok. What does the "<stream>" tag do? That is just there so that multiple <add> commands can be in the same XML document. it is just an arbitrary parent tag. The parser on the other end only validates once it hits a known cmd tag.
        Hide
        Shalin Shekhar Mangar added a comment -

        Sorry Ryan for stalling this. The tests run fine. But we need to change the global lock to be final.

        What does the "<stream>" tag do?

        Please go ahead and take charge. It's all yours.

        Show
        Shalin Shekhar Mangar added a comment - Sorry Ryan for stalling this. The tests run fine. But we need to change the global lock to be final. What does the "<stream>" tag do? Please go ahead and take charge. It's all yours.
        Hide
        Noble Paul added a comment -

        Hi Ryan,
        You got me wrong. I was trying to say how to make CommonsHttpSolrServer efficient by streaming docs as StreamingHttpSolrServer does when I add docs in bulk using

        SolrServer.add(List<SolrInputDocument> docs)
        

        Yes , StreamingHttpSolrServer uses only one connection per thread and it closes the connection after waiting for 250ms for a new document.

        Show
        Noble Paul added a comment - Hi Ryan, You got me wrong. I was trying to say how to make CommonsHttpSolrServer efficient by streaming docs as StreamingHttpSolrServer does when I add docs in bulk using SolrServer.add(List<SolrInputDocument> docs) Yes , StreamingHttpSolrServer uses only one connection per thread and it closes the connection after waiting for 250ms for a new document.
        Hide
        Ryan McKinley added a comment -

        I would like to go ahead and commit this patch soon. Shalin - did the changes in the latest patch resolve the issues you referred to?

        Show
        Ryan McKinley added a comment - I would like to go ahead and commit this patch soon. Shalin - did the changes in the latest patch resolve the issues you referred to?
        Hide
        Ryan McKinley added a comment -

        Are you looking at the patch or just brainstorming how this could be implemented?

        I am referring to the client code .The method in UpdateRequest

        public Collection<ContentStream> getContentStreams() throws IOException

        Unknown macro: { return ClientUtils.toContentStreams( getXML(), ClientUtils.TEXT_XML ); }

        This means that the getXML() method actually constructs a huge String which is the entire xml. It is not very good if we are writing out very large no:of docs

        This is not how the patch works... for starters, it never calls getContentStreams() for UpdateRequest. It opens a single connection and continually dumps the xml for each request. Rather then call getXML() the patch adds a function writeXml( Writer ) that writes directly to the open buffer.

        I am suggesting that ComonsHttpSolrServer has scope for improvement. Instead of building that String in memory we can just start streaming it to the server. So the OutputStream can be passed on to UpdateRequest so that it can write the xml right into the stream. So there is streaming effectively on both ends

        The ComonsHttpSolrServer is fine, but you are right that each UpdateRequest may want to write the content directly to the open stream. The ContentStream interface gives us all that control. One thing to note is that if you do not specify the length, the HttpCommons server will use chunked encoding.

        But I think adding the StreammingUpdateSolrServer resolves that for everyone. Uses have either option.

        One drawback of a StreamingHttpSolrServer is that it ends up opening multiple connections for uploading the documents

        Nonsense – that is exactly what this avoids. It opens a single connection and writes everything to it. You can configure how many threads you want emptying the queue; each one will open a connection.

        Another enhancement . We can add one (or more ) extra thread in the server to do the call updaterequestprocessor.processAdd() .

        That opens a whole can of worms... perhaps better discussed on java-dev. For now I think sticking to the 1 thread/prequest is a good model. If you want multiple threads running on the server use multiple connections (it is even an argument in the StreammingHttpSolrServer)

        Show
        Ryan McKinley added a comment - Are you looking at the patch or just brainstorming how this could be implemented? I am referring to the client code .The method in UpdateRequest public Collection<ContentStream> getContentStreams() throws IOException Unknown macro: { return ClientUtils.toContentStreams( getXML(), ClientUtils.TEXT_XML ); } This means that the getXML() method actually constructs a huge String which is the entire xml. It is not very good if we are writing out very large no:of docs This is not how the patch works... for starters, it never calls getContentStreams() for UpdateRequest. It opens a single connection and continually dumps the xml for each request. Rather then call getXML() the patch adds a function writeXml( Writer ) that writes directly to the open buffer. I am suggesting that ComonsHttpSolrServer has scope for improvement. Instead of building that String in memory we can just start streaming it to the server. So the OutputStream can be passed on to UpdateRequest so that it can write the xml right into the stream. So there is streaming effectively on both ends The ComonsHttpSolrServer is fine, but you are right that each UpdateRequest may want to write the content directly to the open stream. The ContentStream interface gives us all that control. One thing to note is that if you do not specify the length, the HttpCommons server will use chunked encoding. But I think adding the StreammingUpdateSolrServer resolves that for everyone. Uses have either option. One drawback of a StreamingHttpSolrServer is that it ends up opening multiple connections for uploading the documents Nonsense – that is exactly what this avoids. It opens a single connection and writes everything to it. You can configure how many threads you want emptying the queue; each one will open a connection. Another enhancement . We can add one (or more ) extra thread in the server to do the call updaterequestprocessor.processAdd() . That opens a whole can of worms... perhaps better discussed on java-dev. For now I think sticking to the 1 thread/prequest is a good model. If you want multiple threads running on the server use multiple connections (it is even an argument in the StreammingHttpSolrServer)
        Hide
        Noble Paul added a comment -

        Please ignore the number 40K docs. I just took it from your perf test numbers. I thought you were writing docs as a list

        I am referring to the client code .The method in UpdateRequest

        public Collection<ContentStream> getContentStreams() throws IOException {
            return ClientUtils.toContentStreams( getXML(), ClientUtils.TEXT_XML );
        }
        

        This means that the getXML() method actually constructs a huge String which is the entire xml. It is not very good if we are writing out very large no:of docs

        I am suggesting that ComonsHttpSolrServer has scope for improvement. Instead of building that String in memory we can just start streaming it to the server. So the OutputStream can be passed on to UpdateRequest so that it can write the xml right into the stream. So there is streaming effectively on both ends

        This is valid where users do bulk updates. Not when they write one doc at a time.

        The new method SolrServer#add(Iterator<SolrInputDocs> docs) can start writing the docs immedietly and the docs can be uploaded as and when they are being produced. It is not related to these issue exactly, But the intend of this issue is to make upload faster.

        SOLR-865 is not very related to this issue. StreamingHttpSolrServer can use javabin format as well.

        with the StreamingHttpSolrServer, you can send documents one at a time and each documents starts sending as soon as it can

        One drawback of a StreamingHttpSolrServer is that it ends up opening multiple connections for uploading the documents

        Another enhancement . We can add one (or more ) extra thread in the server to do the call updaterequestprocessor.processAdd() .

        Show
        Noble Paul added a comment - Please ignore the number 40K docs. I just took it from your perf test numbers. I thought you were writing docs as a list I am referring to the client code .The method in UpdateRequest public Collection<ContentStream> getContentStreams() throws IOException { return ClientUtils.toContentStreams( getXML(), ClientUtils.TEXT_XML ); } This means that the getXML() method actually constructs a huge String which is the entire xml. It is not very good if we are writing out very large no:of docs I am suggesting that ComonsHttpSolrServer has scope for improvement. Instead of building that String in memory we can just start streaming it to the server. So the OutputStream can be passed on to UpdateRequest so that it can write the xml right into the stream. So there is streaming effectively on both ends This is valid where users do bulk updates. Not when they write one doc at a time. The new method SolrServer#add(Iterator<SolrInputDocs> docs) can start writing the docs immedietly and the docs can be uploaded as and when they are being produced. It is not related to these issue exactly, But the intend of this issue is to make upload faster. SOLR-865 is not very related to this issue. StreamingHttpSolrServer can use javabin format as well. with the StreamingHttpSolrServer, you can send documents one at a time and each documents starts sending as soon as it can One drawback of a StreamingHttpSolrServer is that it ends up opening multiple connections for uploading the documents Another enhancement . We can add one (or more ) extra thread in the server to do the call updaterequestprocessor.processAdd() .
        Hide
        Ryan McKinley added a comment -
        One problem with the current implementation is that it writes everything to a local buffer and then uploads the whole content in one go. So essentially we are wasting time till your 40K docs are written into this huge XML. Another issue is that this XML has to fit in memory. We need to fix the comonsHttpSolrServer first. It must stream the docs .

        Really?!

        Are you saying that the RequestEntity.html#getContentLength() does not behave as advertised?

        This implementation returns -1 for the content length, and that tells the connection use chunk encoding to transmit the request entity.

        Where do you get the 40K number? Is it from the log? If so, that is the expected behavior – the server continually processes documents until it reaches the end of the stream. That may be 1 document that may be 1M docs...

        If you are filling up a Collection<SolrInputDocument> with 40K docs, then sending it of course it is going to hold on to 40K docs at once.

        We can enhance the SolrServer API by adding a method SolrServer#add(Iterator<SolrInputDocs> docs) . So CommonsHttpSolrServer can start writing the documents as and when you are producing your documents . We also have the advantage of not storing the huge list of docs in memory.

        I'm not following... with the StreamingHttpSolrServer, you can send documents one at a time and each documents starts sending as soon as it can. There is a BlockingQueue<UpdateRequest> that holds all UpdateRequests that come through the 'request' method. BlockingQueue's only hold a fixed number of items and will block before adding something beyond the limit.

        Another enhancement is using a different format (SOLR-865). It uses javabin format and it can be extremely fast compared to XML and the payload can be reduced substantially.

        That is a different issue altogether. That relates to having something different running on the server. Once that is in, then this should be able to leverage that as well...

        Show
        Ryan McKinley added a comment - One problem with the current implementation is that it writes everything to a local buffer and then uploads the whole content in one go. So essentially we are wasting time till your 40K docs are written into this huge XML. Another issue is that this XML has to fit in memory. We need to fix the comonsHttpSolrServer first. It must stream the docs . Really?! Are you saying that the RequestEntity.html#getContentLength() does not behave as advertised? This implementation returns -1 for the content length, and that tells the connection use chunk encoding to transmit the request entity. Where do you get the 40K number? Is it from the log? If so, that is the expected behavior – the server continually processes documents until it reaches the end of the stream. That may be 1 document that may be 1M docs... If you are filling up a Collection<SolrInputDocument> with 40K docs, then sending it of course it is going to hold on to 40K docs at once. We can enhance the SolrServer API by adding a method SolrServer#add(Iterator<SolrInputDocs> docs) . So CommonsHttpSolrServer can start writing the documents as and when you are producing your documents . We also have the advantage of not storing the huge list of docs in memory. I'm not following... with the StreamingHttpSolrServer, you can send documents one at a time and each documents starts sending as soon as it can. There is a BlockingQueue<UpdateRequest> that holds all UpdateRequests that come through the 'request' method. BlockingQueue's only hold a fixed number of items and will block before adding something beyond the limit. Another enhancement is using a different format ( SOLR-865 ). It uses javabin format and it can be extremely fast compared to XML and the payload can be reduced substantially. That is a different issue altogether. That relates to having something different running on the server. Once that is in, then this should be able to leverage that as well...
        Hide
        Noble Paul added a comment - - edited
        • One problem with the current implementation is that it writes everything to a local buffer and then uploads the whole content in one go. So essentially we are wasting time till your 40K docs are written into this huge XML. Another issue is that this XML has to fit in memory. We need to fix the comonsHttpSolrServer first. It must stream the docs .
        • We can enhance the SolrServer API by adding a method SolrServer#add(Iterator<SolrInputDocs> docs) . So CommonsHttpSolrServer can start writing the documents as and when you are producing your documents . We also have the advantage of not storing the huge list of docs in memory.
        • Another enhancement is using a different format (SOLR-865). It uses javabin format and it can be extremely fast compared to XML and the payload can be reduced substantially.

        Probably we can overcome the perf problems to a certain extent with these two fixes.

        Show
        Noble Paul added a comment - - edited One problem with the current implementation is that it writes everything to a local buffer and then uploads the whole content in one go. So essentially we are wasting time till your 40K docs are written into this huge XML. Another issue is that this XML has to fit in memory. We need to fix the comonsHttpSolrServer first. It must stream the docs . We can enhance the SolrServer API by adding a method SolrServer#add(Iterator<SolrInputDocs> docs) . So CommonsHttpSolrServer can start writing the documents as and when you are producing your documents . We also have the advantage of not storing the huge list of docs in memory. Another enhancement is using a different format ( SOLR-865 ). It uses javabin format and it can be extremely fast compared to XML and the payload can be reduced substantially. Probably we can overcome the perf problems to a certain extent with these two fixes.
        Hide
        Ryan McKinley added a comment -

        Shalin, did you get a change to look at this version?

        Show
        Ryan McKinley added a comment - Shalin, did you get a change to look at this version?
        Hide
        Ryan McKinley added a comment -
        the actual error response (http error code, and the body?)
        That is / can be encoded in the Throwable implementation no? As is, it adds the same Exception that you get when running the standard one.
        the InputDocument that caused the failure, if there was one

        I don't know if there is a good way to do this. The scope that catches exceptions is way outside of the context where we knew what was written. This is just like hitting an error somewhere in the add( List<Doc> ) – the response has no way to know where it broke. Ideally the error that Solr returns is enough information, but the current exception behavior is to barf or not.

        what else?

        My thought was we could add any of the specialized parameters in finer grained Throwable implementations. When we have a real 'error' response, this could be parsed and passed as an exception.

        Show
        Ryan McKinley added a comment - the actual error response (http error code, and the body?) That is / can be encoded in the Throwable implementation no? As is, it adds the same Exception that you get when running the standard one. the InputDocument that caused the failure, if there was one I don't know if there is a good way to do this. The scope that catches exceptions is way outside of the context where we knew what was written. This is just like hitting an error somewhere in the add( List<Doc> ) – the response has no way to know where it broke. Ideally the error that Solr returns is enough information, but the current exception behavior is to barf or not. what else? My thought was we could add any of the specialized parameters in finer grained Throwable implementations. When we have a real 'error' response, this could be parsed and passed as an exception.
        Hide
        Yonik Seeley added a comment -

        public void handleError(Throwable ex)

        Ideally, we could give the user back

        • the actual error response (http error code, and the body?)
        • the InputDocument that caused the failure, if there was one
        • what else?

        So it seems like we should add these parameters to handleError, and just document that they currently return null if we can't get that info yet. Or perhaps more extensible, a SolrError class that has these fields?

        Show
        Yonik Seeley added a comment - public void handleError(Throwable ex) Ideally, we could give the user back the actual error response (http error code, and the body?) the InputDocument that caused the failure, if there was one what else? So it seems like we should add these parameters to handleError, and just document that they currently return null if we can't get that info yet. Or perhaps more extensible, a SolrError class that has these fields?
        Hide
        Yonik Seeley added a comment -

        IMO callbacks can be tougher to deal with and require client code to be multi-threaded (like signal handling in C that caused tons of trouble for people because they weren't thinking about the fact that it could be called at any time concurrently with other code, esp back in the days when C library calls were not re-entrant).

        But I guess as you say, polling could be added on top of a callback API if needed later.

        Show
        Yonik Seeley added a comment - IMO callbacks can be tougher to deal with and require client code to be multi-threaded (like signal handling in C that caused tons of trouble for people because they weren't thinking about the fact that it could be called at any time concurrently with other code, esp back in the days when C library calls were not re-entrant). But I guess as you say, polling could be added on top of a callback API if needed later.
        Hide
        Ryan McKinley added a comment -

        This can be use like so:

        SolrServer solr = new StreamingHttpSolrServer( url, 2, 5 ) {
          @Override
          public void handleError(Throwable ex) {
             // do somethign...
          }
        };
        
        Show
        Ryan McKinley added a comment - This can be use like so: SolrServer solr = new StreamingHttpSolrServer( url, 2, 5 ) { @Override public void handleError(Throwable ex) { // do somethign... } };
        Hide
        Ryan McKinley added a comment -

        updated version that includes the callback function:

          public void handleError( Throwable ex )
          {
            log.error( "error", ex );
          }
        

        this gets called whenever an error occurs. We could have that keep a list of any errors or whatever. I'm not sure what the default error behavior should be. Collecting everything in a list? just logging?

        Show
        Ryan McKinley added a comment - updated version that includes the callback function: public void handleError( Throwable ex ) { log.error( "error" , ex ); } this gets called whenever an error occurs. We could have that keep a list of any errors or whatever. I'm not sure what the default error behavior should be. Collecting everything in a list? just logging?
        Hide
        Ryan McKinley added a comment -

        I suppose... though I don't see it as a strict requirement – if you need full error handling, use a different SolrServer implementation.

        I think a more reasonable error API would be a callback function rather then polling – the error could occur outside you loop (assuming you break at some point). That callback could easily be converted to a polling api if desired.

        The big thing to note with this API is that calling:
        solr.add( doc )
        just adds it to the queue processes it in the background. It is a BlockingQueue, so after it hits the max size the client will block before it can add – but that should be transparent to the client.

        the error caused by adding that doc may happen much later in time.

        I'll go ahead and add that callback...

        Show
        Ryan McKinley added a comment - I suppose... though I don't see it as a strict requirement – if you need full error handling, use a different SolrServer implementation. I think a more reasonable error API would be a callback function rather then polling – the error could occur outside you loop (assuming you break at some point). That callback could easily be converted to a polling api if desired. The big thing to note with this API is that calling: solr.add( doc ) just adds it to the queue processes it in the background. It is a BlockingQueue, so after it hits the max size the client will block before it can add – but that should be transparent to the client. the error caused by adding that doc may happen much later in time. I'll go ahead and add that callback...
        Hide
        Yonik Seeley added a comment -

        Isn't there a need for a polling API to check for errors?
        I think something like the following client code would be easiest for people to deal with:

        while not done:
           myDoc = [...] // make the doc
           solr.addDoc(myDoc)
           List<Error> errors = solr.getErrors();
           if (errors != null) handleErrors(errors)
        
        Show
        Yonik Seeley added a comment - Isn't there a need for a polling API to check for errors? I think something like the following client code would be easiest for people to deal with: while not done: myDoc = [...] // make the doc solr.addDoc(myDoc) List<Error> errors = solr.getErrors(); if (errors != null ) handleErrors(errors)
        Hide
        Ryan McKinley added a comment -

        API is identical to SolrServer

        rather then instantiating with CommonsHttpSolrServer, you just use StreamingSolrServer.

        The constructor args are:
        StreamingHttpSolrServer(String solrServerUrl, int queueSize, int threadCount )

        queueSize is how big you want the buffering queue to be
        threadCount is the maximum number of threads that will get used to empty the queue

        Show
        Ryan McKinley added a comment - API is identical to SolrServer rather then instantiating with CommonsHttpSolrServer, you just use StreamingSolrServer. The constructor args are: StreamingHttpSolrServer(String solrServerUrl, int queueSize, int threadCount ) queueSize is how big you want the buffering queue to be threadCount is the maximum number of threads that will get used to empty the queue
        Hide
        Yonik Seeley added a comment -

        I haven't looked at the code, but can someone outline what the API looks like (i.e. what would typical pseudo code that used the API look like).

        Show
        Yonik Seeley added a comment - I haven't looked at the code, but can someone outline what the API looks like (i.e. what would typical pseudo code that used the API look like).
        Hide
        Ryan McKinley added a comment -

        Here is an updated patch that bufferes UpdateRequests rather then SolrInputDocuments – this is good because then everything is handled in request( final SolrRequest request ) so blocking can be easier.

        Also this lets up submit submit <add commands with the commitWithin syntax.

        All /update commands are streamed to server unless waitSearcher==true

        Shalin – can you check this over for threading issues or general improvements?

        Show
        Ryan McKinley added a comment - Here is an updated patch that bufferes UpdateRequests rather then SolrInputDocuments – this is good because then everything is handled in request( final SolrRequest request ) so blocking can be easier. Also this lets up submit submit <add commands with the commitWithin syntax. All /update commands are streamed to server unless waitSearcher==true Shalin – can you check this over for threading issues or general improvements?
        Hide
        Noble Paul added a comment - - edited

        another observation :
        why do we need a ScheduledExecutorService we only need a ThreadPoolExecutorService
        The name of the class is somewhat misleading. We must document that this may be exclusively used for updates

        How about renaming this to StreamingUpdateSolrServer

        Show
        Noble Paul added a comment - - edited another observation : why do we need a ScheduledExecutorService we only need a ThreadPoolExecutorService The name of the class is somewhat misleading. We must document that this may be exclusively used for updates How about renaming this to StreamingUpdateSolrServer
        Hide
        Shalin Shekhar Mangar added a comment -

        why not? All the tests pass for me...

        There are multiple places where SolrExampleTest calls commit without waitSearcher=true and proceeds to query and assert on results. The failure happens intermittently. Try varying the number of threads and you may be able to reproduce the failure.

        The add calls do not come to the process method.

        I meant the request method. Sorry about that. The SolrServer.add() calls the request method but this implementation does not. If there are multiple threads using this class, new documents may get added to the queue before we acquire the lock inside blockUntilFinished due to the call to commit.

        Show
        Shalin Shekhar Mangar added a comment - why not? All the tests pass for me... There are multiple places where SolrExampleTest calls commit without waitSearcher=true and proceeds to query and assert on results. The failure happens intermittently. Try varying the number of threads and you may be able to reproduce the failure. The add calls do not come to the process method. I meant the request method. Sorry about that. The SolrServer.add() calls the request method but this implementation does not. If there are multiple threads using this class, new documents may get added to the queue before we acquire the lock inside blockUntilFinished due to the call to commit.
        Hide
        Ryan McKinley added a comment -
        The add calls do not come to the process method. Due to this some add calls may still get in before the commit acquires the lock (assuming multiple producers). Is this class strictly meant for a single document producer use-case?

        I don't totally follow... but if possible, it would be good if multiple threads could fill the same queue. This would let the StreamingHttpSolrServer manage all solr communication

        Show
        Ryan McKinley added a comment - The add calls do not come to the process method. Due to this some add calls may still get in before the commit acquires the lock (assuming multiple producers). Is this class strictly meant for a single document producer use-case? I don't totally follow... but if possible, it would be good if multiple threads could fill the same queue. This would let the StreamingHttpSolrServer manage all solr communication
        Hide
        Ryan McKinley added a comment -
        The SolrExampleTest cannot be used directly

        why not? All the tests pass for me...

        sending a commit() (with waitSearcher=true) should wait for all the docs to get added, then issue the commit, then return.

        Show
        Ryan McKinley added a comment - The SolrExampleTest cannot be used directly why not? All the tests pass for me... sending a commit() (with waitSearcher=true) should wait for all the docs to get added, then issue the commit, then return.
        Hide
        Ryan McKinley added a comment -

        Thanks for looking at this!

        The if (req.getCommitWithin() < 0) should be > 0, right?

        no – if a commit within time is specified, we can not use the open request. It needs to start a new request so that a new <add ...> command could be sent. I experimeted with sending everything over the open connection, but we would need to add a new parent tag to the xml format. That might not be a bad idea. Then we could send:
        <stream>
        <add>
        <doc...>
        </add>
        <add commitWithin="">
        ...
        <commit />
        <add ...>

        and finally
        </stream>

        Show
        Ryan McKinley added a comment - Thanks for looking at this! The if (req.getCommitWithin() < 0) should be > 0, right? no – if a commit within time is specified, we can not use the open request. It needs to start a new request so that a new <add ...> command could be sent. I experimeted with sending everything over the open connection, but we would need to add a new parent tag to the xml format. That might not be a bad idea. Then we could send: <stream> <add> <doc...> </add> <add commitWithin=""> ... <commit /> <add ...> and finally </stream>
        Hide
        Shalin Shekhar Mangar added a comment -

        I've started taking a look at this. A couple of points:

        • Instantiating the lock in blockUntilFinished and nulling it can cause a race condition. A thread in the 'add' method can find that the lock is not null, another thread can null it and the first thread proceeds to lock on it leading to NPE. In the same way, creation of multiple locks is possible in the blockUntilFinished method.
        • The run method calling itself recursively looks suspicious. We may be in danger of overflowing the stack.
        • The SolrExampleTest cannot be used directly because it depends on the order of the commands being executed. We must clearly document that clients should not depend on the order of commands being executed in the same order as they are given.
        • The if (req.getCommitWithin() < 0) should be > 0, right?
        • The add calls do not come to the process method. Due to this some add calls may still get in before the commit acquires the lock (assuming multiple producers). Is this class strictly meant for a single document producer use-case?
        • The wait loop in blockUntilFinished is very CPU intensive. It can probably be optimized.

        I'm experimenting with a slightly different implementation. Still trying to tie the loose ends. I hope to have a patch soon.

        Show
        Shalin Shekhar Mangar added a comment - I've started taking a look at this. A couple of points: Instantiating the lock in blockUntilFinished and nulling it can cause a race condition. A thread in the 'add' method can find that the lock is not null, another thread can null it and the first thread proceeds to lock on it leading to NPE. In the same way, creation of multiple locks is possible in the blockUntilFinished method. The run method calling itself recursively looks suspicious. We may be in danger of overflowing the stack. The SolrExampleTest cannot be used directly because it depends on the order of the commands being executed. We must clearly document that clients should not depend on the order of commands being executed in the same order as they are given. The if (req.getCommitWithin() < 0) should be > 0, right? The add calls do not come to the process method. Due to this some add calls may still get in before the commit acquires the lock (assuming multiple producers). Is this class strictly meant for a single document producer use-case? The wait loop in blockUntilFinished is very CPU intensive. It can probably be optimized. I'm experimenting with a slightly different implementation. Still trying to tie the loose ends. I hope to have a patch soon.
        Hide
        Ryan McKinley added a comment -

        removes @Override from interfaces

        I guess:
        <property name="java.compat.version" value="1.5" />

        does not take check everything!

        Show
        Ryan McKinley added a comment - removes @Override from interfaces I guess: <property name="java.compat.version" value="1.5" /> does not take check everything!
        Hide
        Shalin Shekhar Mangar added a comment -

        Ryan, I'm seeing compile errors related to @Override with interface methods (that's a Java 6 feature). Also, new IOException( e ) is not defined (also Java 6, I guess).

        Show
        Shalin Shekhar Mangar added a comment - Ryan, I'm seeing compile errors related to @Override with interface methods (that's a Java 6 feature). Also, new IOException( e ) is not defined (also Java 6, I guess).
        Hide
        Ryan McKinley added a comment -

        The other aspect no note is that StreamingHttpSolrServer sends the request in the background so after you call: add( doc ) or add( list ) that thread is free to keep working. With the off the shelf CommonsHttpSolrServer the client needs to wait for the server to parse the request and index the data before it can continue.

        This switches where the client gets blocked.

        • with CommonsHttpSolrServer it blocks while waiting for solr to write the response
        • with StreamingHttpSolrServer it blocks while waiting for solr to read the request
        Show
        Ryan McKinley added a comment - The other aspect no note is that StreamingHttpSolrServer sends the request in the background so after you call: add( doc ) or add( list ) that thread is free to keep working. With the off the shelf CommonsHttpSolrServer the client needs to wait for the server to parse the request and index the data before it can continue. This switches where the client gets blocked. with CommonsHttpSolrServer it blocks while waiting for solr to write the response with StreamingHttpSolrServer it blocks while waiting for solr to read the request
        Hide
        Ryan McKinley added a comment -

        > how much of the 3.5minutes -> 30seconds is due to the logging?

        ~1 min. When I turn off logging completely, the time is ~2.5 mins (also, note that with 3 threads, it is down to 20sec)

        RE: calling add( doc ) vs add( List<doc> )...
        yes, things are much better if you call add( List<doc> ) however, it is not the most convenient api if you are running though tons of things.

        I would expect (but have not tried) adding 40K docs in one call to add( List<doc> ) would have the same time as this StreamingHttpSolrServer. It is probably also similar if you buffer 100? 1,000? at a time, but I have not tried.

        The StreamingHttpSolrServer essentially handles the buffering for you. It keeps an http connection open as long as the Queue has docs to send. It can start multiple threads and drain the same Queue simultaneously.

        Essentially, this just offers an easier interface to get the best possible performance. The trade off (for now) is that there is no good error reporting.

        Show
        Ryan McKinley added a comment - > how much of the 3.5minutes -> 30seconds is due to the logging? ~1 min. When I turn off logging completely, the time is ~2.5 mins (also, note that with 3 threads, it is down to 20sec) RE: calling add( doc ) vs add( List<doc> )... yes, things are much better if you call add( List<doc> ) however, it is not the most convenient api if you are running though tons of things. I would expect (but have not tried) adding 40K docs in one call to add( List<doc> ) would have the same time as this StreamingHttpSolrServer. It is probably also similar if you buffer 100? 1,000? at a time, but I have not tried. The StreamingHttpSolrServer essentially handles the buffering for you. It keeps an http connection open as long as the Queue has docs to send. It can start multiple threads and drain the same Queue simultaneously. Essentially, this just offers an easier interface to get the best possible performance. The trade off (for now) is that there is no good error reporting.
        Hide
        Ian Holsman added a comment -

        also..

        The way I've always thought commonsHttpServer was designed to work was to batch up the documents in groups and use the add(List<SolrInputDocument>) function instead of the individual add(SolrInputDocument) function.

        is adding documents in batches (say 100 or 1,000 at a time) also as slow as 3.5m ? or wont this work as you won't get the feedback on which document failed to add if there was an error.

        Show
        Ian Holsman added a comment - also.. The way I've always thought commonsHttpServer was designed to work was to batch up the documents in groups and use the add(List<SolrInputDocument>) function instead of the individual add(SolrInputDocument) function. is adding documents in batches (say 100 or 1,000 at a time) also as slow as 3.5m ? or wont this work as you won't get the feedback on which document failed to add if there was an error.
        Hide
        Ian Holsman added a comment -

        how much of the 3.5minutes -> 30seconds is due to the logging?
        would it be simpler to change the log message to 'DEBUG' instead of 'INFO' and see how the performance of the regular commons server behaves then?

        Show
        Ian Holsman added a comment - how much of the 3.5minutes -> 30seconds is due to the logging? would it be simpler to change the log message to 'DEBUG' instead of 'INFO' and see how the performance of the regular commons server behaves then?
        Hide
        Ryan McKinley added a comment -

        also, using the recent patch with a Queue size = 20 and thread count=3 (on a dual core machine), the indexing time dropped from 30 secs -> 20 secs. In sum: with the data I am working with, switch from CommonsHttpSolrServer => StreamingHttpSolrServer changes the index time from 3.5 min => 20 sec, or ~10x faster

        Show
        Ryan McKinley added a comment - also, using the recent patch with a Queue size = 20 and thread count=3 (on a dual core machine), the indexing time dropped from 30 secs -> 20 secs. In sum: with the data I am working with, switch from CommonsHttpSolrServer => StreamingHttpSolrServer changes the index time from 3.5 min => 20 sec, or ~10x faster
        Hide
        Ryan McKinley added a comment -

        Here is an updated version that lets you specify how many threads should work on emptying the Queue. It also adds tests to make sure it passes all the same tests that CommonsHttpSolrServer and EmbeddedSolrServer already pass. That is, it is a drop in replacement and passes all existing tests.

        One big change is that calling commit or optimize with waitSearcher=true:
        1. blocks adding new docs to the Queue
        2. waits for the Queue to empty (send all docs)
        3. waits for <commit waitSearcher=true /> to return
        4. unblocks everything
        5. finally continues execution.

        My threading chops are not great, so I may be doing something really strange. It would be good to get some more eyes on this!

        Show
        Ryan McKinley added a comment - Here is an updated version that lets you specify how many threads should work on emptying the Queue. It also adds tests to make sure it passes all the same tests that CommonsHttpSolrServer and EmbeddedSolrServer already pass. That is, it is a drop in replacement and passes all existing tests. One big change is that calling commit or optimize with waitSearcher=true: 1. blocks adding new docs to the Queue 2. waits for the Queue to empty (send all docs) 3. waits for <commit waitSearcher=true /> to return 4. unblocks everything 5. finally continues execution. My threading chops are not great, so I may be doing something really strange. It would be good to get some more eyes on this!
        Hide
        Ryan McKinley added a comment -

        The error handling behavior is less then ideal.... (as it is for <add> with multiple documents already)

        As is, it will break the connection when an error occures, but will not know which one. However, it will continue indexing the Queue when an error occures.

        Show
        Ryan McKinley added a comment - The error handling behavior is less then ideal.... (as it is for <add> with multiple documents already) As is, it will break the connection when an error occures, but will not know which one. However, it will continue indexing the Queue when an error occures.
        Hide
        Ryan McKinley added a comment -

        This implementation buffers documents in a BlockingQueue<SolrInputDocument>
        when a document is added, it makes sure a Thread is working on sending the queue to solr.

        While this implementation only starts one thread, it would be easy to extent this so that multiple threads are writing to solr simultaneously and draining the same Queue.

        Show
        Ryan McKinley added a comment - This implementation buffers documents in a BlockingQueue<SolrInputDocument> when a document is added, it makes sure a Thread is working on sending the queue to solr. While this implementation only starts one thread, it would be easy to extent this so that multiple threads are writing to solr simultaneously and draining the same Queue.
        Hide
        Ryan McKinley added a comment -

        One basic problem with calling add( SolrInputDocument) with the CommonsHttpSolrServer is that it logs a request for each document. This can be a substantial impact. For example while indexing 40K docs on my machine, it takes ~3 1/2 mins. If I turn logging off the time drops to ! 2 1/2 mins. With the streaming approach, the time drops to 20sec! Some of that is obviously because it limits the logging:

        INFO: {add=[id1,id2,id3,id4, ...(38293 more)]} 0 20714
        
        Show
        Ryan McKinley added a comment - One basic problem with calling add( SolrInputDocument) with the CommonsHttpSolrServer is that it logs a request for each document. This can be a substantial impact. For example while indexing 40K docs on my machine, it takes ~3 1/2 mins. If I turn logging off the time drops to ! 2 1/2 mins. With the streaming approach, the time drops to 20sec! Some of that is obviously because it limits the logging: INFO: {add=[id1,id2,id3,id4, ...(38293 more)]} 0 20714

          People

          • Assignee:
            Ryan McKinley
            Reporter:
            Ryan McKinley
          • Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development