Solr
  1. Solr
  2. SOLR-1763

Integrate Solr Cell/Tika as an UpdateRequestProcessor

    Details

      Description

      From Chris Hostetter's original post in solr-dev:

      As someone with very little knowledge of Solr Cell and/or Tika, I find myself wondering if ExtractingRequestHandler would make more sense as an extractingUpdateProcessor – where it could be configured to take take either binary fields (or string fields containing URLs) out of the Documents, parse them with tika, and add the various XPath matching hunks of text back into the document as new fields.

      Then ExtractingRequestHandler just becomes a handler that slurps up it's ContentStreams and adds them as binary data fields and adds the other literal params as fields.

      Wouldn't that make things like SOLR-1358, and using Tika with URLs/filepaths in XML and CSV based updates fairly trivial?

      -Hoss

      I couldn't agree more, so I decided to add it as an issue.

        Issue Links

          Activity

          Hide
          Jan Høydahl added a comment -

          Re-posting my comment from solr-dev in this ticket:
          Good match. UpdateProcessors is the way to go for functionality which modifiy documents prior to indexing.
          With this, we can mix and match any type of content source with other processing needs.

          I think it can be neneficial to have the choice to do extration on the SolrJ side. But you don't always have that choice, if your source is a crawler without built-in Tika, some base64 encoded field in an XML or some other random source, you want to do the extraction at an arbitrary place in the chain.

          Examples:
          Crawler (httpheaders, binarybody) -> TikaUpdateProcessor (+title, +text, +meta...) -> index
          XML (title, pdfurl) -> GetUrlProcessor (+pdfbin) -> TikaUpdateProcessor (+text, +meta) -> index
          DIH (city, street, lat, lon) -> LatLon2GeoHashProcessor (+geohash) -> index

          I propose to model the document processor chain more after FAST ESP's flexible processing chain, which must be seen as an industry best practice. I'm thinking of starting a Wiki page to model what direction we should go.


          Jan Høydahl - search architect
          Cominvent AS - www.cominvent.com

          Show
          Jan Høydahl added a comment - Re-posting my comment from solr-dev in this ticket: Good match. UpdateProcessors is the way to go for functionality which modifiy documents prior to indexing. With this, we can mix and match any type of content source with other processing needs. I think it can be neneficial to have the choice to do extration on the SolrJ side. But you don't always have that choice, if your source is a crawler without built-in Tika, some base64 encoded field in an XML or some other random source, you want to do the extraction at an arbitrary place in the chain. Examples: Crawler (httpheaders, binarybody) -> TikaUpdateProcessor (+title, +text, +meta...) -> index XML (title, pdfurl) -> GetUrlProcessor (+pdfbin) -> TikaUpdateProcessor (+text, +meta) -> index DIH (city, street, lat, lon) -> LatLon2GeoHashProcessor (+geohash) -> index I propose to model the document processor chain more after FAST ESP's flexible processing chain, which must be seen as an industry best practice. I'm thinking of starting a Wiki page to model what direction we should go. – Jan Høydahl - search architect Cominvent AS - www.cominvent.com
          Hide
          Jan Høydahl added a comment -

          I may have a need for this functionality in an upcoming project. Anyone knowing the code who can estimate the effort?

          Show
          Jan Høydahl added a comment - I may have a need for this functionality in an upcoming project. Anyone knowing the code who can estimate the effort?
          Hide
          Jan Høydahl added a comment -

          Starting to look into this one. Will it make most sense to make the patch against contrib/extraction since it depends on the Tika jars?

          Show
          Jan Høydahl added a comment - Starting to look into this one. Will it make most sense to make the patch against contrib/extraction since it depends on the Tika jars?
          Hide
          Hoss Man added a comment -

          Will it make most sense to make the patch against contrib/extraction since it depends on the Tika jars?

          That would be my suggestion ... an ExtractionUpdateProcessor sitting right next to the ExtractionRequestHandler.

          Show
          Hoss Man added a comment - Will it make most sense to make the patch against contrib/extraction since it depends on the Tika jars? That would be my suggestion ... an ExtractionUpdateProcessor sitting right next to the ExtractionRequestHandler.
          Hide
          Lance Norskog added a comment - - edited

          Can the ExtractionRequestHandler go away?

          Show
          Lance Norskog added a comment - - edited Can the ExtractionRequestHandler go away?
          Hide
          Jan Høydahl added a comment -

          Ideally the UpdateProcessor will do everything that the RequestHandler does and more.
          We might still need a RequestHandler which is capable of accepting a binary file as input, as well as conveying certain request parameters to the UpdateProcessor.
          But that should probably be a new thinner "RawUpdateRequestHandler".

          When this more generic architecture has proven itself superior, then we can start deprecating old stuff. DIH should then also start looking to the UpdateProcessor for its Tika needs.

          Show
          Jan Høydahl added a comment - Ideally the UpdateProcessor will do everything that the RequestHandler does and more. We might still need a RequestHandler which is capable of accepting a binary file as input, as well as conveying certain request parameters to the UpdateProcessor. But that should probably be a new thinner "RawUpdateRequestHandler". When this more generic architecture has proven itself superior, then we can start deprecating old stuff. DIH should then also start looking to the UpdateProcessor for its Tika needs.
          Hide
          Jan Høydahl added a comment -

          I believe these are related in that they attempt to introduce TIKA extraction of some input content and output the extracted text to various fields. They should share code base if possible.

          Show
          Jan Høydahl added a comment - I believe these are related in that they attempt to introduce TIKA extraction of some input content and output the extracted text to various fields. They should share code base if possible.
          Hide
          Jan Høydahl added a comment -

          I won't have time to look at this before october-ish, so anyone feel free to give it a shot

          Show
          Jan Høydahl added a comment - I won't have time to look at this before october-ish, so anyone feel free to give it a shot
          Hide
          Jan Høydahl added a comment -

          Anyone interested in this feature?

          Show
          Jan Høydahl added a comment - Anyone interested in this feature?
          Hide
          Jan Høydahl added a comment -

          Testing bulk

          Show
          Jan Høydahl added a comment - Testing bulk

            People

            • Assignee:
              Unassigned
              Reporter:
              Jan Høydahl
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:

                Development