Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-4864

RegexReplaceProcessorFactory should support pattern capture group substitution in replacement string

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 4.3
    • Fix Version/s: 4.8, 6.0
    • Component/s: update
    • Labels:
      None

      Description

      It is unfortunate the the replacement string for RegexReplaceProcessorFactory is a pure, "quoted" (escaped) literal and does not support pattern capture group substitution. This processor should be enhanced to support full, standard pattern capture group substitution.

      The test case I used:

        <updateRequestProcessorChain name="regex-mark-special-words">
          <processor class="solr.RegexReplaceProcessorFactory">
            <str name="fieldRegex">.*</str>
            <str name="pattern">([^a-zA-Z]|^)(cat|dog|fox)([^a-zA-Z]|$)</str>
            <str name="replacement">$1&lt;&lt;$2&gt;&gt;$3</str>
          </processor>
          <processor class="solr.LogUpdateProcessorFactory" />
          <processor class="solr.RunUpdateProcessorFactory" />
        </updateRequestProcessorChain>
      

      Indexing with this command against the standard Solr example with the above addition to solrconfig:

        curl "http://localhost:8983/solr/update?commit=true&update.chain=regex-mark-special-words" \
        -H 'Content-type:application/json' -d '
        [{"id": "doc-1",
          "title": "Hello World",
          "content": "The cat and the dog jumped over the fox.",
          "other_ss": ["cat","cat bird", "lazy dog", "red fox den"]}]'
      

      Alas, the resulting document consists of:

        "id":"doc-1",
        "title":["Hello World"],
        "content":["The$1<<$2>>$3and the$1<<$2>>$3jumped over the$1<<$2>>$3"],
        "other_ss":["$1<<$2>>$3",
          "$1<<$2>>$3bird",
          "lazy$1<<$2>>$3",
          "red$1<<$2>>$3den"],
      

      The Javadoc for RegexReplaceProcessorFactory uses the exact same terminology of "replacement string", as does Java's Matcher.replaceAll, but clearly the semantics are distinct, with replaceAll supporting pattern capture group substitution for its "replacement string", while RegexReplaceProcessorFactory interprets "replacement string" as being a literal. At a minimum, the RegexReplaceProcessorFactory Javadoc should explicitly state that the string is a literal that does not support pattern capture group substitution.

      The relevant code in RegexReplaceProcessorFactory#init:

      replacement = Matcher.quoteReplacement(replacementParam.toString());
      

      Possible options for the enhancement:

      1. Simply skip the quoteReplacement and fully support pattern capture group substitution with no additional changes. Does have a minor backcompat issue.

      2. Add an alternative to "replacement", say "nonQuotedReplacement" that is not quoted as "replacement" is.

      3. Add an option, say "quotedReplacement" that defaults to "true" for backcompat, but can be set to "false" to support full replaceAll pattern capture group substitution.

        Attachments

        1. SOLR-4864.patch
          8 kB
          Steve Rowe
        2. SOLR-4864.patch
          6 kB
          Sunil Srinivasan
        3. SOLR-4864.patch
          6 kB
          Sunil Srinivasan

          Activity

            People

            • Assignee:
              steve_rowe Steve Rowe
              Reporter:
              jkrupan Jack Krupansky
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: