Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-11869

Remote streaming UpdateRequestProcessor

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Won't Do
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None

      Description

      When indexing documents from content management systems (or digital asset management systems) they usually have fields for metadata given by an editor and they in case of pdfs, docx or any other text formats may also contain the binary content as well, which might be parsed to plain text using tika. This is whats currently supported by the ExtractingRequestHandler. 

      We are now facing situations where we are indexing batches of documents using the UpdateRequestHandler and want to send the binary content of the documents mentioned above as part of the single request to the UpdateRequestHandler. As those documents might be of unknown size and its difficult to send streams along the wire with javax.json APIs, I though about sending the url to the document itself, let solr fetch the document and let it be parsed by tika - using a RemoteStreamingUpdateRequestProcessor.  

      Example:

      { 
       "add": { "id": "doc1", "meta": "foo", "meta": "bar", "text": "Short text" }
       "add": { "id": "doc2", "meta": "will become long", "text_ref": "http://..." }
      }
      

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              diru Dirk Rudolph
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: