Solr
  1. Solr
  2. SOLR-2599

CloneFieldUpdateProcessor (copyField-equse equivilent)

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: update
    • Labels:
      None

      Description

      Need an UpdateProcessor which can copy and move fields

      1. SOLR-2599-hoss.patch
        35 kB
        Hoss Man
      2. SOLR-2599.patch
        19 kB
        Jan Høydahl
      3. SOLR-2599.patch
        20 kB
        Jan Høydahl

        Issue Links

          Activity

          Hide
          Jan Høydahl added a comment -

          Here's the processor. It's been in production for some time at a customer.

          Sample config as follows:

          <processor class="solr.FieldCopyProcessorFactory">
            <str name="source">category</str>
            <str name="dest">category_s</str>
          </processor>
          

          To move (rename) a field:

          <processor class="solr.FieldCopyProcessorFactory">
            <str name="source">LastModified</str>
            <str name="dest">last_modified</str>
            <bool name="move">true</bool>
          </processor>
          

          To append to existing field:

          <processor class="solr.FieldCopyProcessorFactory">
            <str name="source">lastname firstname</str>
            <str name="dest">fullname</str>
            <bool name="append">true</bool>
            <str name="append.delim">, </str>
          </processor>
          

          To append as values to multivalued field, with optional size cap:

          <processor class="solr.FieldCopyProcessorFactory">
            <str name="source">title body</str>
            <str name="dest">text</str>
            <bool name="multival">true</bool>
            <int name="maxChars">100</int>
          </processor>
          
          Show
          Jan Høydahl added a comment - Here's the processor. It's been in production for some time at a customer. Sample config as follows: <processor class= "solr.FieldCopyProcessorFactory" > <str name= "source" >category</str> <str name= "dest" >category_s</str> </processor> To move (rename) a field: <processor class= "solr.FieldCopyProcessorFactory" > <str name= "source" >LastModified</str> <str name= "dest" >last_modified</str> <bool name= "move" > true </bool> </processor> To append to existing field: <processor class= "solr.FieldCopyProcessorFactory" > <str name= "source" >lastname firstname</str> <str name= "dest" >fullname</str> <bool name= "append" > true </bool> <str name= "append.delim" >, </str> </processor> To append as values to multivalued field, with optional size cap: <processor class= "solr.FieldCopyProcessorFactory" > <str name= "source" >title body</str> <str name= "dest" >text</str> <bool name= "multival" > true </bool> < int name= "maxChars" >100</ int > </processor>
          Hide
          Jan Høydahl added a comment -

          Perhaps multival should be renamed multiValued to comply with schema lingo?

          Also, if I make it (optionally) schema aware, I can set multiValued behavior as default if dest field is multivalued. Also, perhaps it makes sense to allow append for multiValued as well, and let it append all source fields to a string, and then adding this concatenated string as one single field value instead of each source as its own value?

          The reason I want to be able to disable strict schema checking is in the case where a processor creates intermediate fields only, which we know will be removed from SolrInputDocument before indexing, so that we can be free to name it whatever we like without causing an error. Unfortunately, ExtractingRequestHandler is too strict here and would benefit from a enforceSchema=false option.

          Show
          Jan Høydahl added a comment - Perhaps multival should be renamed multiValued to comply with schema lingo? Also, if I make it (optionally) schema aware, I can set multiValued behavior as default if dest field is multivalued. Also, perhaps it makes sense to allow append for multiValued as well, and let it append all source fields to a string, and then adding this concatenated string as one single field value instead of each source as its own value? The reason I want to be able to disable strict schema checking is in the case where a processor creates intermediate fields only, which we know will be removed from SolrInputDocument before indexing, so that we can be free to name it whatever we like without causing an error. Unfortunately, ExtractingRequestHandler is too strict here and would benefit from a enforceSchema=false option.
          Hide
          Jan Høydahl added a comment -

          New patch. Renamed multival -> multiValued

          Any comments on functionality, naming or conventions before I prepare for commit?

          Show
          Jan Høydahl added a comment - New patch. Renamed multival -> multiValued Any comments on functionality, naming or conventions before I prepare for commit?
          Hide
          Jan Høydahl added a comment -

          @Hoss, you have not incorporated this in your SOLR-2802, have you? I'd like to get this in, but have not had time to fully investigate your base classes yet. Can we put this in as is and refactor later? If so, what parameter names should change in order to have the same external API after refactoring?

          Show
          Jan Høydahl added a comment - @Hoss, you have not incorporated this in your SOLR-2802 , have you? I'd like to get this in, but have not had time to fully investigate your base classes yet. Can we put this in as is and refactor later? If so, what parameter names should change in order to have the same external API after refactoring?
          Hide
          Hoss Man added a comment -

          Jan:

          I did not incorporate any sort of copy field equivalent in the SOLR-2802 work, but i did implement the "append" logic as a processor (see below)

          Comments on your patch...

          • my personal pref would be to use a slight diff name... (maybe "CloneFieldUpdateProcessor" ?) to help differentiate slightly from <copyField/> and reduce the likelihood of confusion during casual discussion in email/irc (ie: "I'm copying field A to B..."; "wait, are you FieldCopy-ing or CopyField-ing?")
          • as mentioned in SOLR-2825 + SOLR-3095, you shouldn't need to explicitly handle "enabled" in the individual processors
          • i would eliminate the append, append.delim, and multiValued options and only support the multiValued=true behavior - if they want the append logic they can combine this processor with the ConcatFieldUpdateProcessorFactory
          • instead of a "move=true" boolean config, i think it would be more clear what the behavior/alternatives are if we used an "action=clone|rename" config, with the default being "clone"
          • instead of the simple whitespace seperated "source" field name config, it would be nice if we could reuse the field name selector syntax options from FieldMutatingUpdateProcessorFactory (multiple fieldName, fieldRegex, typeName, and typeClass as well as excludes of any/all of those)
          • need to think carefully about how maxChars should work:
            • what if the source values aren't Strings? they could easily be numbers or dates, so it seems like a bad idea to convert them to strings just because they are copied/renamed.
            • even if all we worry about is strings, should it be maxChars per value, maxChars per source field, or total maxChars in dest?
              • specifics need documented
            • personally: i would suggest ripping out the maxChars option and making it a distinct processor that can be configured later in the chain. if we leave it in, then i think it's really important that it should be ignored or throw and error unless the value implements CharSequence, and not forcably toString() every copied value. (so this processor will still be useful with numeric values)
          • need to think carefully about field boosts:
            • either we should try to preserve/combine them on move/copy, or we should make sure we explicitly blow them away
            • either way we need to document it
            • if i'm reading the patch correctly it currently obliterates the boost on the dest field in all cases, even if there is not source value(s) to copy, and ignores any boost on any source field, but we should double check that.
          Show
          Hoss Man added a comment - Jan: I did not incorporate any sort of copy field equivalent in the SOLR-2802 work, but i did implement the "append" logic as a processor (see below) Comments on your patch... my personal pref would be to use a slight diff name... (maybe "CloneFieldUpdateProcessor" ?) to help differentiate slightly from <copyField/> and reduce the likelihood of confusion during casual discussion in email/irc (ie: "I'm copying field A to B..."; "wait, are you FieldCopy-ing or CopyField-ing?") as mentioned in SOLR-2825 + SOLR-3095 , you shouldn't need to explicitly handle "enabled" in the individual processors i would eliminate the append, append.delim, and multiValued options and only support the multiValued=true behavior - if they want the append logic they can combine this processor with the ConcatFieldUpdateProcessorFactory instead of a "move=true" boolean config, i think it would be more clear what the behavior/alternatives are if we used an "action=clone|rename" config, with the default being "clone" instead of the simple whitespace seperated "source" field name config, it would be nice if we could reuse the field name selector syntax options from FieldMutatingUpdateProcessorFactory (multiple fieldName, fieldRegex, typeName, and typeClass as well as excludes of any/all of those) need to think carefully about how maxChars should work: what if the source values aren't Strings? they could easily be numbers or dates, so it seems like a bad idea to convert them to strings just because they are copied/renamed. even if all we worry about is strings, should it be maxChars per value, maxChars per source field, or total maxChars in dest? specifics need documented personally: i would suggest ripping out the maxChars option and making it a distinct processor that can be configured later in the chain. if we leave it in, then i think it's really important that it should be ignored or throw and error unless the value implements CharSequence, and not forcably toString() every copied value. (so this processor will still be useful with numeric values) need to think carefully about field boosts: either we should try to preserve/combine them on move/copy, or we should make sure we explicitly blow them away either way we need to document it if i'm reading the patch correctly it currently obliterates the boost on the dest field in all cases, even if there is not source value(s) to copy, and ignores any boost on any source field, but we should double check that.
          Hide
          Hoss Man added a comment -

          Jan: inspired by your patch and tests, i hacked up a new version that incorporates all my previous comments...

          • CloneFieldUpdateProcessorFactory
            • handles just the core field cloning
            • source can be simple filed name, or the various "selector" style args from FieldMutatingUpdateProcessorFactory
          • TruncateFieldUpdateProcessorFactory
            • FieldMutatingUpdateProcessorFactory
            • implements the 'max chars' style logic
          • IgnoreFieldUpdateProcessorFactory
            • FieldMutatingUpdateProcessorFactory
            • removes fields from the document

          ...take a look at the javadocs and test case and lemme know what you think. I'm pretty sure combinations of these three processors cover all of the examples from your test case.

          Show
          Hoss Man added a comment - Jan: inspired by your patch and tests, i hacked up a new version that incorporates all my previous comments... CloneFieldUpdateProcessorFactory handles just the core field cloning source can be simple filed name, or the various "selector" style args from FieldMutatingUpdateProcessorFactory TruncateFieldUpdateProcessorFactory FieldMutatingUpdateProcessorFactory implements the 'max chars' style logic IgnoreFieldUpdateProcessorFactory FieldMutatingUpdateProcessorFactory removes fields from the document ...take a look at the javadocs and test case and lemme know what you think. I'm pretty sure combinations of these three processors cover all of the examples from your test case.
          Hide
          Hoss Man added a comment -

          I went ahead and committed my patch.

          (one of the beauties of adding more UpdateProcessors like this is that they can be mixed and matched, so if folks have ideas about alternative configuration/behavior we can always add more processors with different names)

          Committed revision 1350050. - trunk
          Committed revision 1350051. - 4x

          Show
          Hoss Man added a comment - I went ahead and committed my patch. (one of the beauties of adding more UpdateProcessors like this is that they can be mixed and matched, so if folks have ideas about alternative configuration/behavior we can always add more processors with different names) Committed revision 1350050. - trunk Committed revision 1350051. - 4x
          Hide
          Kai Gülzau added a comment -

          Exactly what i was looking for.
          Would be nice if this is documented in http://wiki.apache.org/solr/UpdateRequestProcessor

          Show
          Kai Gülzau added a comment - Exactly what i was looking for. Would be nice if this is documented in http://wiki.apache.org/solr/UpdateRequestProcessor
          Hide
          Erick Erickson added a comment -

          Anyone can edit the Wiki by creating a logon, could you go ahead and do this?

          Thanks,
          Erick

          Show
          Erick Erickson added a comment - Anyone can edit the Wiki by creating a logon, could you go ahead and do this? Thanks, Erick
          Hide
          Eric Bus added a comment -

          Is there a specific reason why dest has to be a fixed fieldname? I would like to migrate our copyField settings to this 'new' processor, but that would require wildcards for both source and destination. For example:

          <copyField source="text_" dest="singlestring_" />

          As far as I can see, this processor does not support this behaviour?

          Show
          Eric Bus added a comment - Is there a specific reason why dest has to be a fixed fieldname? I would like to migrate our copyField settings to this 'new' processor, but that would require wildcards for both source and destination. For example: <copyField source="text_ " dest="singlestring_ " /> As far as I can see, this processor does not support this behaviour?

            People

            • Assignee:
              Jan Høydahl
              Reporter:
              Jan Høydahl
            • Votes:
              2 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development