Solr
  1. Solr
  2. SOLR-1536

Support for TokenFilters that may modify input documents

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 1.5
    • Fix Version/s: None
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      In some scenarios it's useful to be able to create or modify fields in the input document based on analysis of other fields of this document. This need arises e.g. when indexing multilingual documents, or when doing NLP processing such as NER. However, currently this is not possible to do.

      This issue provides an implementation of this functionality that consists of the following parts:

      • DocumentAlteringFilterFactory - abstract superclass that indicates that TokenFilter-s created from this factory may modify fields in a SolrInputDocument.
      • TypeAsFieldFilterFactory - example implementation that illustrates this concept, with a JUnit test.
      • DocumentBuilder modifications to support this functionality.
      1. altering.patch
        26 kB
        Andrzej Bialecki
      2. altering.patch
        26 kB
        Andrzej Bialecki
      3. altering.patch
        26 kB
        Andrzej Bialecki

        Issue Links

          Activity

          Hide
          Otis Gospodnetic added a comment -

          Is this better than writing a custom UpdateRequestProcessor that takes the value of the incoming SolrInputDocument (SID), does something to it, removes the original field, and adds the modified version back to SID?

          Show
          Otis Gospodnetic added a comment - Is this better than writing a custom UpdateRequestProcessor that takes the value of the incoming SolrInputDocument (SID), does something to it, removes the original field, and adds the modified version back to SID?
          Hide
          Andrzej Bialecki added a comment -

          My opinion may be biased, but I'll try to be as objective as I can I think it's better, because it provides you much more flexibility in building analysis & indexing chains without coding. If we went with URProcessor you would have to implement a new one whenever your analysis chain changes ... With the approach in this patch it's just a configuration issue, and not an issue of implementing as many custom update processors as there are possible combinations ...

          Show
          Andrzej Bialecki added a comment - My opinion may be biased, but I'll try to be as objective as I can I think it's better, because it provides you much more flexibility in building analysis & indexing chains without coding. If we went with URProcessor you would have to implement a new one whenever your analysis chain changes ... With the approach in this patch it's just a configuration issue, and not an issue of implementing as many custom update processors as there are possible combinations ...
          Hide
          Mike Perham added a comment -

          This would be hugely useful for us in implementing a profanity detector. We'd like to scan the 'content' field for profane tokens and mark a boolean 'safe' field with the results.

          Show
          Mike Perham added a comment - This would be hugely useful for us in implementing a profanity detector. We'd like to scan the 'content' field for profane tokens and mark a boolean 'safe' field with the results.
          Hide
          Jan Høydahl added a comment -

          In my head document-level modifications belong in UpdateRequestProcessors. You always have SOLR-1725 to script those quickly, and configuring a chain is easily done in XML (http://wiki.apache.org/solr/SolrConfigXml#UpdateRequestProcessorChain_section).

          Trouble is, when you need to act on an analyzed version of a field, say, to match terms against a normalized dictionary. To allow this, could we allow Analysis to run anywhere in the update chain? That way we can put UpdateRequestProcessors after analysis as well:

          <updateRequestProcessorChain name="test">
              <processor class="org.apache.solr.update.processor.MyPreProcessorFactory" />
              <analysis />
              <processor class="org.apache.solr.update.processor.MyPostProcessorFactory" />
          </updateRequestProcessorChain>
          

          Making <analysis/> optional, the default would be at end as today. I have no idea of how easy such a change would be with the current architecture.

          Show
          Jan Høydahl added a comment - In my head document-level modifications belong in UpdateRequestProcessors. You always have SOLR-1725 to script those quickly, and configuring a chain is easily done in XML ( http://wiki.apache.org/solr/SolrConfigXml#UpdateRequestProcessorChain_section ). Trouble is, when you need to act on an analyzed version of a field, say, to match terms against a normalized dictionary. To allow this, could we allow Analysis to run anywhere in the update chain? That way we can put UpdateRequestProcessors after analysis as well: <updateRequestProcessorChain name= "test" > <processor class= "org.apache.solr.update.processor.MyPreProcessorFactory" /> <analysis /> <processor class= "org.apache.solr.update.processor.MyPostProcessorFactory" /> </updateRequestProcessorChain> Making <analysis/> optional, the default would be at end as today. I have no idea of how easy such a change would be with the current architecture.
          Hide
          Mike Perham added a comment -

          Another developer just mentioned that I might be able to use TFVs to implement the profanity detector. We've got termVectors="true" on the content field since we are also using MoreLikeThis. If I can get access to the field's TFV in the URP, I can just run through the profanities, checking for each one in the TFV... I'm not sure if this is possible - need to check the javadocs.

          Show
          Mike Perham added a comment - Another developer just mentioned that I might be able to use TFVs to implement the profanity detector. We've got termVectors="true" on the content field since we are also using MoreLikeThis. If I can get access to the field's TFV in the URP, I can just run through the profanities, checking for each one in the TFV... I'm not sure if this is possible - need to check the javadocs.
          Hide
          Andrzej Bialecki added a comment -

          Term freq. vectors are not available at this stage, unless you go to an expense of creating a MemoryIndex. I think the solution I proposed is less costly and more generic.

          Show
          Andrzej Bialecki added a comment - Term freq. vectors are not available at this stage, unless you go to an expense of creating a MemoryIndex. I think the solution I proposed is less costly and more generic.
          Hide
          Andrzej Bialecki added a comment -

          Patch updated to trunk.

          Show
          Andrzej Bialecki added a comment - Patch updated to trunk.
          Hide
          Andrzej Bialecki added a comment -

          Updated patch - previous patch produced NPEs.

          Show
          Andrzej Bialecki added a comment - Updated patch - previous patch produced NPEs.
          Hide
          Andrzej Bialecki added a comment -

          Patch updated to trunk.

          Show
          Andrzej Bialecki added a comment - Patch updated to trunk.
          Hide
          Bill Bell added a comment -

          This fails on the latest trunk.

          Show
          Bill Bell added a comment - This fails on the latest trunk.
          Hide
          Andrzej Bialecki added a comment -

          Resolving as Won't Fix - the complications to DocumentBuilder don't seem worth it. It's probably better to implement this as an UpdateRequestProcessor.

          Show
          Andrzej Bialecki added a comment - Resolving as Won't Fix - the complications to DocumentBuilder don't seem worth it. It's probably better to implement this as an UpdateRequestProcessor.
          Hide
          Kai Gülzau added a comment -

          Is there a follow up ticket for Jan Høydahl's idea of placing the analyzer phase in the middle of the updateRequestProcessorChain?
          This would solve my problem http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201302.mbox/%3CB65DA877C3F93B4FB39EA49A1A03C95CC30173%40email.novomind.com%3E

          Show
          Kai Gülzau added a comment - Is there a follow up ticket for Jan Høydahl 's idea of placing the analyzer phase in the middle of the updateRequestProcessorChain? This would solve my problem http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201302.mbox/%3CB65DA877C3F93B4FB39EA49A1A03C95CC30173%40email.novomind.com%3E

            People

            • Assignee:
              Unassigned
              Reporter:
              Andrzej Bialecki
            • Votes:
              4 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development