Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-3018

enhance solr to support per-document results in batch mode



    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 4.0-ALPHA
    • None
    • clients - java
    • None
    • any


      It would be useful to have Solr return per-document results instead of a generic SolrException when multiple documents are being passed via CommonsHttpSolrServer.The API supports adding multiple streams/files to a request (see SOLR-3010 for an example usage in jython) but when an error is detected, an exception is returned to the caller and the caller must then determine which document failed to be processed. This is particularly problematic for simple document extraction when using solr and tika to pre-process documents for indexing. In this case, a batch of documents is passed to solr for processing by tika. If any of the documents fails to be processed, a SolrException is thrown:

      Mon Jan  9 18:04:50 2012 Caught SolrException handling documents [13356414, 23590833, 33917483] (<jclass org.apache.solr.common.SolrException 9>, org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.TNEFParser@6d893ae8  org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: TIK

      Instead of this exception, the API could be configured to return a response that has a result per-document indicating the server's response for processing of the batch. A caller could then check the response and extract the relevant parsed content for successful documents and do special handling for documents that failed to be parsed.

      There are reasonable workarounds for this in the current product. First, callers can pass 1 document at a time for processing and then there is no ambiguity on what the result is for a document. Another approach is to pass a small batch of documents to Solr/Tika and if an exception is thrown, reprocess the documents one at a time. If the corpus of documents is largely well-behaved, minimal retries will be needed to reprocess failures.


        Issue Links



              Unassigned Unassigned
              rtulloh Rob Tulloh
              1 Vote for this issue
              2 Start watching this issue