Solr
  1. Solr
  2. SOLR-8113

Accept replacement strings in CloneFieldUpdateProcessorFactory

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 5.3
    • Fix Version/s: 5.4, 6.0
    • Component/s: update
    • Labels:
      None
    • Flags:
      Patch

      Description

      Presently CloneFieldUpdateProcessorFactory accepts regular expressions to select source fields, which mirrors wildcards in the source for copyField in the schema. This patch adds a counterpart to copyField's wildcards in the dest attribute by interpreting the dest parameter as a regex replacement string.

      1. SOLR-8113.patch
        50 kB
        Hoss Man
      2. SOLR-8113.patch
        49 kB
        Gus Heck
      3. SOLR-8113.patch
        48 kB
        Hoss Man
      4. SOLR-8113.patch
        35 kB
        Gus Heck
      5. SOLR-8113.patch
        9 kB
        Gus Heck

        Issue Links

          Activity

          Hide
          Gus Heck added a comment -

          Patch vs 5x

          Show
          Gus Heck added a comment - Patch vs 5x
          Hide
          Gus Heck added a comment - - edited

          This was motivated by the response to SOLR-8109. This bridges the gap in the suggested alternate solution, but does look like it will likely be a lot less performant. I couldn't find a way to re-use the match done by the FieldNameSelector, so the regex matches have to execute twice. All attempts I made to do that resulted in major API changes.

          This is however more flexible, will not interfere with atomic updates (as copyField apparently does) and Allows further movement away from the use of copyField.

          It would be nice if both this and SOLR-8109 to become available since speed is one tradeoff, and atomic update support etc is another.

          Show
          Gus Heck added a comment - - edited This was motivated by the response to SOLR-8109 . This bridges the gap in the suggested alternate solution, but does look like it will likely be a lot less performant. I couldn't find a way to re-use the match done by the FieldNameSelector, so the regex matches have to execute twice. All attempts I made to do that resulted in major API changes. This is however more flexible, will not interfere with atomic updates (as copyField apparently does) and Allows further movement away from the use of copyField. It would be nice if both this and SOLR-8109 to become available since speed is one tradeoff, and atomic update support etc is another.
          Hide
          Hoss Man added a comment -

          Gus, just read through your patch.

          My chief concerns are:

          1. you've redefined the semantics of how the dest string is interpreted when a fieldRegex is used to identify the source (so there's a back compat break there depending on the value of dest)
          2. You've designed the "config syntax" for this new feature around the requirement that it can only be used if at least one fieldRegex is used to identify the source fields ...

          The original purpose of the FieldSelector API was to provide more general appoaches for configuring which fields and UpdateProcessor should care about beyond simple string field name glob/pattern matching. I think that pattern replacements for destination field naming should (in general) be independent of the original selection criteria, so that a user could say something like...

          I want to make a copy of any StrField in my documents such that the copy has the same name as the original but with _t appended.

          ...and that shold be possible with this feature, regardless of wether the user is using an specific naming convention (ie "*_s") for all StrFields in their index, using some syntax that might look like this...

          <processor class="solr.CloneFieldUpdateProcessorFactory">
            <!-- existing source selector syntax -->
            <lst name="source">
              <str name="typeClass">solr.StrField</str>
            </lst>
            <!-- hypothetical new destination pattern syntax -->
            <lst name="dest">
              <str name="pattern">.*</str>
              <str name="replacement">$0_t</str>
            </lst>
          </processor>
          

          ...while prefix->prefix and suffix->suffix style of cloning similar to what copyField supports could also be specified. Example: a <copyField src="*_s" dest="*_t" /> equivilent would be...

          <processor class="solr.CloneFieldUpdateProcessorFactory">
            <!-- existing source selector syntax -->
            <lst name="source">
              <str name="fieldRegex">^(.*)_s$</str>
            </lst>
            <!-- hypothetical new destination pattern syntax -->
            <lst name="dest">
              <str name="pattern">^(.*)_s$</str>
              <str name="replacement">$1_t</str>
            </lst>
          </processor>
          

          That's fairly verbose, but if we get the nuts & blots of the general case implemented, then it should be trivial to add syntactic sugar to simplify the configuration...

          <processor class="solr.CloneFieldUpdateProcessorFactory">
            <!-- hypothetical syntactic sugar equivilent to the above example -->
            <!-- since no other source selector args are specified, assume pattern based cloning -->
            <str name="pattern">^(.*)_s$</str>
            <str name="replacement">$1_t</str>
          </processor>
          

          What do you think?

          Show
          Hoss Man added a comment - Gus, just read through your patch. My chief concerns are: you've redefined the semantics of how the dest string is interpreted when a fieldRegex is used to identify the source (so there's a back compat break there depending on the value of dest ) You've designed the "config syntax" for this new feature around the requirement that it can only be used if at least one fieldRegex is used to identify the source fields ... The original purpose of the FieldSelector API was to provide more general appoaches for configuring which fields and UpdateProcessor should care about beyond simple string field name glob/pattern matching. I think that pattern replacements for destination field naming should (in general) be independent of the original selection criteria, so that a user could say something like... I want to make a copy of any StrField in my documents such that the copy has the same name as the original but with _t appended. ...and that shold be possible with this feature, regardless of wether the user is using an specific naming convention (ie "*_s") for all StrFields in their index, using some syntax that might look like this... <processor class= "solr.CloneFieldUpdateProcessorFactory" > <!-- existing source selector syntax --> <lst name= "source" > <str name= "typeClass" >solr.StrField</str> </lst> <!-- hypothetical new destination pattern syntax --> <lst name= "dest" > <str name= "pattern" >.*</str> <str name= "replacement" >$0_t</str> </lst> </processor> ...while prefix->prefix and suffix->suffix style of cloning similar to what copyField supports could also be specified. Example: a <copyField src="*_s" dest="*_t" /> equivilent would be... <processor class= "solr.CloneFieldUpdateProcessorFactory" > <!-- existing source selector syntax --> <lst name= "source" > <str name= "fieldRegex" >^(.*)_s$</str> </lst> <!-- hypothetical new destination pattern syntax --> <lst name= "dest" > <str name= "pattern" >^(.*)_s$</str> <str name= "replacement" >$1_t</str> </lst> </processor> That's fairly verbose, but if we get the nuts & blots of the general case implemented, then it should be trivial to add syntactic sugar to simplify the configuration... <processor class= "solr.CloneFieldUpdateProcessorFactory" > <!-- hypothetical syntactic sugar equivilent to the above example --> <!-- since no other source selector args are specified, assume pattern based cloning --> <str name= "pattern" >^(.*)_s$</str> <str name= "replacement" >$1_t</str> </processor> What do you think?
          Hide
          Gus Heck added a comment -

          Thanks for the review. You make some good points. I'd probably never want to ask the user to repeat patterns that had to match (or want to have to write validation around that)... nobody will want to type the regex twice, it could only lead to a mistake. Maybe it's as simple as using "replacement" rather than dest, and documenting/validating that only one or the other should be supplied. If neither is supplied default to ^(.*)$ for the fieldRegex? This results in something like:

          <processor class="solr.CloneFieldUpdateProcessorFactory">
            <lst name="source">
              <str name="typeClass">solr.StrField</str>
            </lst>
            <str name="replacement">$1_t</str> <!-- relies on default ^(.*)$ fieldRegex -->
          </processor>

          for your first example. In the event they do want to specify a particular regex...

          <processor class="solr.CloneFieldUpdateProcessorFactory">
            <lst name="source">
              <str name="typeClass">solr.StrField</str>
              <str name="fieldRegex">^(.*)_s$</str>
            </lst>
            <str name="replacement">$1_t</str> 
          </processor>

          and thus your second example (with no string type) looks like this:

          <processor class="solr.CloneFieldUpdateProcessorFactory">
            <lst name="source">
              <str name="fieldRegex">^(.*)_s$</str>
            </lst>
            <str name="replacement">$1_t</str>
          </processor>

          This also seems to avoid the user needing to think about whether or not they should use $0 or $1. (except in some sort of funky exotic cases where they might be using both $0 and $1, $2, etc... which will involve them supplying a pattern anyway and should still work).

          Show
          Gus Heck added a comment - Thanks for the review. You make some good points. I'd probably never want to ask the user to repeat patterns that had to match (or want to have to write validation around that)... nobody will want to type the regex twice, it could only lead to a mistake. Maybe it's as simple as using "replacement" rather than dest, and documenting/validating that only one or the other should be supplied. If neither is supplied default to ^(.*)$ for the fieldRegex? This results in something like: <processor class= "solr.CloneFieldUpdateProcessorFactory" > <lst name= "source" > <str name= "typeClass" >solr.StrField</str> </lst> <str name= "replacement" >$1_t</str> <!-- relies on default ^(.*)$ fieldRegex --> </processor> for your first example. In the event they do want to specify a particular regex... <processor class= "solr.CloneFieldUpdateProcessorFactory" > <lst name= "source" > <str name= "typeClass" >solr.StrField</str> <str name= "fieldRegex" >^(.*)_s$</str> </lst> <str name= "replacement" >$1_t</str> </processor> and thus your second example (with no string type) looks like this: <processor class= "solr.CloneFieldUpdateProcessorFactory" > <lst name= "source" > <str name= "fieldRegex" >^(.*)_s$</str> </lst> <str name= "replacement" >$1_t</str> </processor> This also seems to avoid the user needing to think about whether or not they should use $0 or $1. (except in some sort of funky exotic cases where they might be using both $0 and $1, $2, etc... which will involve them supplying a pattern anyway and should still work).
          Hide
          Gus Heck added a comment -

          After chatting briefly with Hoss Man at Lucene/Solr Revolution it became clear that the key point of his suggestion was the complete separation of the selection phase and the replacement phase. The attached patch provides his suggested configuration options. This does introduce the possibility that a field could match the selector and not match the replacement pattern. I have handled this case by ignoring such fields (as if they were not selected) and logging a debug message. I also wound up creating my own entire unit test before I found the existing tests in FieldMutatingProcessorFactoryTest. I have moved the tests from there into my test class (CloneFieldUpdateFactoryTest) as well so that they are easier to find. Both tests read the same config file.

          Show
          Gus Heck added a comment - After chatting briefly with Hoss Man at Lucene/Solr Revolution it became clear that the key point of his suggestion was the complete separation of the selection phase and the replacement phase. The attached patch provides his suggested configuration options. This does introduce the possibility that a field could match the selector and not match the replacement pattern. I have handled this case by ignoring such fields (as if they were not selected) and logging a debug message. I also wound up creating my own entire unit test before I found the existing tests in FieldMutatingProcessorFactoryTest. I have moved the tests from there into my test class (CloneFieldUpdateFactoryTest) as well so that they are easier to find. Both tests read the same config file.
          Hide
          Gus Heck added a comment -

          Any thoughts on my latest patch Hoss Man? Others? comments welcome.

          Show
          Gus Heck added a comment - Any thoughts on my latest patch Hoss Man ? Others? comments welcome.
          Hide
          Hoss Man added a comment -

          Gus: I've been out sick most of this week, and am now way behind on a bunch of stuff – but this issue is on my radar, and I will try to review ASAP.

          Show
          Hoss Man added a comment - Gus: I've been out sick most of this week, and am now way behind on a bunch of stuff – but this issue is on my radar, and I will try to review ASAP.
          Hide
          Hoss Man added a comment -

          Gus: This is definitely along the lines of what I had in mind – and i like your test refactoring / additions.

          I've updated the patch as I reviewed – mainly in the ariea of documentation and additional error handling/messages when parsing the config...

          • removed stray addition to FirstFieldValueUpdateProcessorFactory javadocs ... left over from old patch?
          • CloneFieldUpdateProcessorFactory javadocs:
            • fixed <pattern> to <str name="pattern"> (and likewise for replacement
            • clarify what the the literals & types were for the new config in the description
            • reworded example description to be bulleted list instead of run on sentence(s)
            • moved/reworded "common case" explanation to the end, after all the major functionality is explained, to clarify it's syntactic sugar and put next to it's example.
            • replaced the one off comment about FirstValueUpdateProcessor with a more general comment about cloning into multivalue fields and various FieldValueSubsetUpdateProcessorFactory options
          • CloneFieldUpdateProcessorFactoryTest
            • standardized indenting
            • updated testCloneFieldExample to include the additions made to the javadoc example
            • updated testCloneCombinations to include a first value clone test
              • added corrisponding clone-first to solrconfig-update-processor-chains.xml
            • updated testCloneField with more equivilence tests – this helps ensure we've got good coverage for the case where multi-selector + pattern + replacement that results in single dest field getting values from multiple source fields.
              • added corrisponding new clone-single-regex, clone-multi-regex, clone-array-regex, clone-selector-regex and clone-simple-regex-syntax to solrconfig-update-processor-chains.xml as needed
          • CloneFieldUpdateProcessorFactory code:
            • fixed getInstance to use getSourceSelector for safe error handling (eliminates the unused warning you asked about)
            • refactored init method into 2 helper methods specific to the two syntax styles (readability)
            • tweaked config error messages & added additional error handling of bad config combos
            • improved error reporting when pattern is invalid and can't be compile

          My one remaining concern with this patch is the use of Matcher.replaceFirst ... i feel like we should probably be using Matcher.replaceAll since it would provide a feature superset of replaceFirst (ie: using replaceAll can still support all the current patch behavior via start/end bound constraints + capture groups, but replaceFirst can't support everything possible with replaceAll).

          Gus: What do you think?

          Show
          Hoss Man added a comment - Gus: This is definitely along the lines of what I had in mind – and i like your test refactoring / additions. I've updated the patch as I reviewed – mainly in the ariea of documentation and additional error handling/messages when parsing the config... removed stray addition to FirstFieldValueUpdateProcessorFactory javadocs ... left over from old patch? CloneFieldUpdateProcessorFactory javadocs: fixed <pattern> to <str name="pattern"> (and likewise for replacement clarify what the the literals & types were for the new config in the description reworded example description to be bulleted list instead of run on sentence(s) moved/reworded "common case" explanation to the end, after all the major functionality is explained, to clarify it's syntactic sugar and put next to it's example. replaced the one off comment about FirstValueUpdateProcessor with a more general comment about cloning into multivalue fields and various FieldValueSubsetUpdateProcessorFactory options CloneFieldUpdateProcessorFactoryTest standardized indenting updated testCloneFieldExample to include the additions made to the javadoc example updated testCloneCombinations to include a first value clone test added corrisponding clone-first to solrconfig-update-processor-chains.xml updated testCloneField with more equivilence tests – this helps ensure we've got good coverage for the case where multi-selector + pattern + replacement that results in single dest field getting values from multiple source fields. added corrisponding new clone-single-regex, clone-multi-regex, clone-array-regex, clone-selector-regex and clone-simple-regex-syntax to solrconfig-update-processor-chains.xml as needed CloneFieldUpdateProcessorFactory code: fixed getInstance to use getSourceSelector for safe error handling (eliminates the unused warning you asked about) refactored init method into 2 helper methods specific to the two syntax styles (readability) tweaked config error messages & added additional error handling of bad config combos improved error reporting when pattern is invalid and can't be compile My one remaining concern with this patch is the use of Matcher.replaceFirst ... i feel like we should probably be using Matcher.replaceAll since it would provide a feature superset of replaceFirst (ie: using replaceAll can still support all the current patch behavior via start/end bound constraints + capture groups, but replaceFirst can't support everything possible with replaceAll). Gus: What do you think?
          Hide
          Gus Heck added a comment -

          All your updates look fine. I notice that most of my tests disappeared, but probably there was a lot of overlap with your original tests (which also verified boost maintenance whereas mine did not). I'm trusting that you are happy with the coverage there.

          I also added 2 more validations of config parameters to give nicer messages if someone uses 3 or 4 of the possible parameters (src/dest/pattern/replacement). This would have been caught by the "extra params" check at the end but is less clear. I also moved the extra params check to the main method as a safety net for all cases.

          I had been on the fence WRT replace vs replaceAll, (simplicity vs flexibility) but in looking around I notice FieldNameMutatingUpdateProcessor uses replace all. Consistency seems like a good idea.

          Show
          Gus Heck added a comment - All your updates look fine. I notice that most of my tests disappeared, but probably there was a lot of overlap with your original tests (which also verified boost maintenance whereas mine did not). I'm trusting that you are happy with the coverage there. I also added 2 more validations of config parameters to give nicer messages if someone uses 3 or 4 of the possible parameters (src/dest/pattern/replacement). This would have been caught by the "extra params" check at the end but is less clear. I also moved the extra params check to the main method as a safety net for all cases. I had been on the fence WRT replace vs replaceAll, (simplicity vs flexibility) but in looking around I notice FieldNameMutatingUpdateProcessor uses replace all. Consistency seems like a good idea.
          Hide
          Gus Heck added a comment -

          Correction... a combination of the formatting change and my memory inflating the number of tests in the test class fooled me into thinking my tests disappeared when I looked too quickly. They are in fact there, so I think we're good to go per our conversation tonight.

          Show
          Gus Heck added a comment - Correction... a combination of the formatting change and my memory inflating the number of tests in the test class fooled me into thinking my tests disappeared when I looked too quickly. They are in fact there, so I think we're good to go per our conversation tonight.
          Hide
          Hoss Man added a comment -

          Patch Updates...

          • Fixed javadocs to refer to replaceAll (code change was already previous patch)
          • added testCloneFieldRegexReplaceAll to prove replaceAll usecases would work.
            • test initially failed due to the Matcher.replaceAll being conditional on a succesful "Mather.match()" call.
            • Replaced "Mather.match()" with a "Matcher.find()" call to ensure replaceAll would work as intended
          • Updated the javadocs example config (and corrisponding test config) refering to "ending in _price" to actually use string ending bounds.
          • Updated the test configs to include the start/end boundary rules for (^feat(.*)s$) so it matches the javadocs

          ...i plan to commit & start backporting as soon as my full test + precommit finishes

          Show
          Hoss Man added a comment - Patch Updates... Fixed javadocs to refer to replaceAll (code change was already previous patch) added testCloneFieldRegexReplaceAll to prove replaceAll usecases would work. test initially failed due to the Matcher.replaceAll being conditional on a succesful "Mather.match()" call. Replaced "Mather.match()" with a "Matcher.find()" call to ensure replaceAll would work as intended Updated the javadocs example config (and corrisponding test config) refering to "ending in _price " to actually use string ending bounds. Updated the test configs to include the start/end boundary rules for ( ^feat(.*)s$ ) so it matches the javadocs ...i plan to commit & start backporting as soon as my full test + precommit finishes
          Hide
          ASF subversion and git services added a comment -

          Commit 1712195 from hossman@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1712195 ]

          SOLR-8113: CloneFieldUpdateProcessorFactory now supports choosing a dest field name based on a regex pattern and replacement init options.

          Show
          ASF subversion and git services added a comment - Commit 1712195 from hossman@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1712195 ] SOLR-8113 : CloneFieldUpdateProcessorFactory now supports choosing a dest field name based on a regex pattern and replacement init options.
          Hide
          ASF subversion and git services added a comment -

          Commit 1712204 from hossman@apache.org in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1712204 ]

          SOLR-8113: CloneFieldUpdateProcessorFactory now supports choosing a dest field name based on a regex pattern and replacement init options. (merge r1712195)

          Show
          ASF subversion and git services added a comment - Commit 1712204 from hossman@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1712204 ] SOLR-8113 : CloneFieldUpdateProcessorFactory now supports choosing a dest field name based on a regex pattern and replacement init options. (merge r1712195)
          Hide
          Hoss Man added a comment -

          Thanks Gus.

          Show
          Hoss Man added a comment - Thanks Gus.

            People

            • Assignee:
              Hoss Man
              Reporter:
              Gus Heck
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development