Solr
  1. Solr
  2. SOLR-4864

RegexReplaceProcessorFactory should support pattern capture group substitution in replacement string

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.3
    • Fix Version/s: 4.8, 6.0
    • Component/s: update
    • Labels:
      None

      Description

      It is unfortunate the the replacement string for RegexReplaceProcessorFactory is a pure, "quoted" (escaped) literal and does not support pattern capture group substitution. This processor should be enhanced to support full, standard pattern capture group substitution.

      The test case I used:

        <updateRequestProcessorChain name="regex-mark-special-words">
          <processor class="solr.RegexReplaceProcessorFactory">
            <str name="fieldRegex">.*</str>
            <str name="pattern">([^a-zA-Z]|^)(cat|dog|fox)([^a-zA-Z]|$)</str>
            <str name="replacement">$1&lt;&lt;$2&gt;&gt;$3</str>
          </processor>
          <processor class="solr.LogUpdateProcessorFactory" />
          <processor class="solr.RunUpdateProcessorFactory" />
        </updateRequestProcessorChain>
      

      Indexing with this command against the standard Solr example with the above addition to solrconfig:

        curl "http://localhost:8983/solr/update?commit=true&update.chain=regex-mark-special-words" \
        -H 'Content-type:application/json' -d '
        [{"id": "doc-1",
          "title": "Hello World",
          "content": "The cat and the dog jumped over the fox.",
          "other_ss": ["cat","cat bird", "lazy dog", "red fox den"]}]'
      

      Alas, the resulting document consists of:

        "id":"doc-1",
        "title":["Hello World"],
        "content":["The$1<<$2>>$3and the$1<<$2>>$3jumped over the$1<<$2>>$3"],
        "other_ss":["$1<<$2>>$3",
          "$1<<$2>>$3bird",
          "lazy$1<<$2>>$3",
          "red$1<<$2>>$3den"],
      

      The Javadoc for RegexReplaceProcessorFactory uses the exact same terminology of "replacement string", as does Java's Matcher.replaceAll, but clearly the semantics are distinct, with replaceAll supporting pattern capture group substitution for its "replacement string", while RegexReplaceProcessorFactory interprets "replacement string" as being a literal. At a minimum, the RegexReplaceProcessorFactory Javadoc should explicitly state that the string is a literal that does not support pattern capture group substitution.

      The relevant code in RegexReplaceProcessorFactory#init:

      replacement = Matcher.quoteReplacement(replacementParam.toString());
      

      Possible options for the enhancement:

      1. Simply skip the quoteReplacement and fully support pattern capture group substitution with no additional changes. Does have a minor backcompat issue.

      2. Add an alternative to "replacement", say "nonQuotedReplacement" that is not quoted as "replacement" is.

      3. Add an option, say "quotedReplacement" that defaults to "true" for backcompat, but can be set to "false" to support full replaceAll pattern capture group substitution.

      1. SOLR-4864.patch
        8 kB
        Steve Rowe
      2. SOLR-4864.patch
        6 kB
        Sunil Srinivasan
      3. SOLR-4864.patch
        6 kB
        Sunil Srinivasan

        Activity

        Hide
        Hoss Man added a comment -

        It was designed for the simple case, but i agree - an option to let you use a more robust capture related replacement string makes sense.

        My vote would probably be for option #3 (although perhaps "literalReplacement" would be a less obscure name for non-java people not directly familiar with the Matcher function names). But option #2 would also work fine (as long as there was good error checking to ensure they didn't specify both types of replacement)

        Show
        Hoss Man added a comment - It was designed for the simple case, but i agree - an option to let you use a more robust capture related replacement string makes sense. My vote would probably be for option #3 (although perhaps "literalReplacement" would be a less obscure name for non-java people not directly familiar with the Matcher function names). But option #2 would also work fine (as long as there was good error checking to ensure they didn't specify both types of replacement)
        Hide
        Sunil Srinivasan added a comment -

        Here is a patch for the functionality that Hoss/Jack wanted. Please review. I've added the additional tests as part of the testRegexReplace. Please let me know if they need to be separated out.

        Show
        Sunil Srinivasan added a comment - Here is a patch for the functionality that Hoss/Jack wanted. Please review. I've added the additional tests as part of the testRegexReplace. Please let me know if they need to be separated out.
        Hide
        Steve Rowe added a comment -

        Hi Sunil,

        Patch looks good, a few comments:

        1. Unless you're making changes to the import section of .java files, you shouldn't rearrange them or add/subtract whitespace. You may need to tell your IDE to stop doing this for you.
        2. In RegexReplaceProcessorFactory.init(), instead of args.removeArg() to get the boolean literalReplacement param value, you should use args.removeBooleanArg(), because users might specify this arg as <str name="literalReplacement">false</str> instead of as a <bool>, resulting in a String-valued object returned by args.removeArg(), and your cast to Boolean would throw an exception. (NamedList.removeBooleanArg() handles type checking and parsing for you.)
        3. To test the ability to accept a String-valued boolean param, I recommend adding an <updateRequestProcessorChain> with a <str>-valued literalReplacement to solrconfig-update-processor-chains.xml, and adding a test for it to FieldMutatingUpdateProcessorTest.testRegexReplace().
        4. I recommend adding new <updateRequestProcessorChain>-s in solrconfig-update-processor-chains.xml, and corresponding tests in FieldMutatingUpdateProcessorTest.testRegexReplace(), that verify that when literalReplacement is true, either explicitly or by default, a text/pattern/replacement combination that would trigger group substitution if literalReplacement were false would not trigger group substitution.
        Show
        Steve Rowe added a comment - Hi Sunil, Patch looks good, a few comments: Unless you're making changes to the import section of .java files, you shouldn't rearrange them or add/subtract whitespace. You may need to tell your IDE to stop doing this for you. In RegexReplaceProcessorFactory.init() , instead of args.removeArg() to get the boolean literalReplacement param value, you should use args.removeBooleanArg() , because users might specify this arg as <str name="literalReplacement">false</str> instead of as a <bool> , resulting in a String -valued object returned by args.removeArg() , and your cast to Boolean would throw an exception. ( NamedList.removeBooleanArg() handles type checking and parsing for you.) To test the ability to accept a String -valued boolean param, I recommend adding an <updateRequestProcessorChain> with a <str> -valued literalReplacement to solrconfig-update-processor-chains.xml , and adding a test for it to FieldMutatingUpdateProcessorTest.testRegexReplace() . I recommend adding new <updateRequestProcessorChain> -s in solrconfig-update-processor-chains.xml , and corresponding tests in FieldMutatingUpdateProcessorTest.testRegexReplace() , that verify that when literalReplacement is true , either explicitly or by default, a text/pattern/replacement combination that would trigger group substitution if literalReplacement were false would not trigger group substitution.
        Hide
        Steve Rowe added a comment -

        Sunil, one other thing I forgot to mention: you should add an example of using the param to the class javadocs on RegexReplaceProcessorFactory, and also explain how the new param works.

        Show
        Steve Rowe added a comment - Sunil, one other thing I forgot to mention: you should add an example of using the param to the class javadocs on RegexReplaceProcessorFactory , and also explain how the new param works.
        Hide
        Sunil Srinivasan added a comment -

        Thanks Steve for the review

        I've made the changes and attached the patch

        Show
        Sunil Srinivasan added a comment - Thanks Steve for the review I've made the changes and attached the patch
        Hide
        Steve Rowe added a comment -

        Thanks Sunil.

        The attached patch moves your javadoc addition to its own paragraph and rewords it a little to be more explicit about what literalReplacement means.

        I also removed an unnecessary cast from Object to Boolean for the return value from NamedList.removeBooleanArg(); added a test case for backreference replacement syntax when literalReplacement is not specified and defaults to true; and added a CHANGES.txt entry.

        I think it's ready to go. I'll commit later today.

        Show
        Steve Rowe added a comment - Thanks Sunil. The attached patch moves your javadoc addition to its own paragraph and rewords it a little to be more explicit about what literalReplacement means. I also removed an unnecessary cast from Object to Boolean for the return value from NamedList.removeBooleanArg() ; added a test case for backreference replacement syntax when literalReplacement is not specified and defaults to true; and added a CHANGES.txt entry. I think it's ready to go. I'll commit later today.
        Hide
        ASF subversion and git services added a comment -

        Commit 1586093 from sarowe@apache.org in branch 'dev/trunk'
        [ https://svn.apache.org/r1586093 ]

        SOLR-4864: RegexReplaceProcessorFactory should support pattern capture group substitution in replacement string.

        Show
        ASF subversion and git services added a comment - Commit 1586093 from sarowe@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1586093 ] SOLR-4864 : RegexReplaceProcessorFactory should support pattern capture group substitution in replacement string.
        Hide
        ASF subversion and git services added a comment -

        Commit 1586094 from sarowe@apache.org in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1586094 ]

        SOLR-4864: RegexReplaceProcessorFactory should support pattern capture group substitution in replacement string. (merged trunk r1586093)

        Show
        ASF subversion and git services added a comment - Commit 1586094 from sarowe@apache.org in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1586094 ] SOLR-4864 : RegexReplaceProcessorFactory should support pattern capture group substitution in replacement string. (merged trunk r1586093)
        Hide
        Steve Rowe added a comment -

        Committed to trunk and branch_4x.

        Thanks Sunil and Jack!

        Show
        Steve Rowe added a comment - Committed to trunk and branch_4x. Thanks Sunil and Jack!
        Hide
        Uwe Schindler added a comment -

        Close issue after release of 4.8.0

        Show
        Uwe Schindler added a comment - Close issue after release of 4.8.0

          People

          • Assignee:
            Steve Rowe
            Reporter:
            Jack Krupansky
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development