Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-13242

RegexReplaceProcessorFactory not making accurate replacement

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 7.6, 7.7, 7.7.1
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      We are using the RegexReplaceProcessorFactory, and have tried with all of the following configurations in solrconfig.xml:

       

      <processor class="solr.RegexReplaceProcessorFactory">
         <str name="fieldName">content</str>
         <str name="pattern">(\s*\r?\n){2,}</str>
         <str name="replacement"><br><br></str>
         <bool name="literalReplacement">true</bool>
       </processor>

      <processor class="solr.RegexReplaceProcessorFactory">
         <str name="fieldName">content</str>
         <str name="pattern">([ \s]*\r?\n){2,}</str>
         <str name="replacement"><br><br></str>
         <bool name="literalReplacement">true</bool>
       </processor>

       <processor class="solr.RegexReplaceProcessorFactory">
         <str name="fieldName">content</str>
         <str name="pattern">(\s*\n){2,}</str>
         <str name="replacement"><br><br></str>
         <bool name="literalReplacement">true</bool>
       </processor>

       <processor class="solr.RegexReplaceProcessorFactory">
         <str name="fieldName">content</str>
         <str name="pattern">(\n\s*){2,}</str>
         <str name="replacement"><br><br></str>
         <bool name="literalReplacement">true</bool>
       </processor>

       

      The regex pattern of (\s*\r?\n){2,}, ([ \s]\r?\n){2,}, (\s\n){2,} and (\n\s*){2,} are working perfectly in regex101.com, in which all the \n will be replaced by only two <br>

      However, in Solr, there are cases (in Example 2 and 3 below) that has four <br> in a row. This should not be the case, as we have already set it to replace by two <br> regardless of how many \n are there in a row.

       

       

      Example 1: The sentence that the above regex pattern is working correctly 

      *Original content in EML file:*  

      Dear Sir, 

       

      I am terminating 

      Original content:    Dear Sir,  \n\n \n \n\n I am terminating

      Index content:     Dear Sir,  <br><br>I am terminating 

       

      Example 2: The sentence that the above regex pattern is partially working (as you can see, instead of 2 <br>, there are 4 <br>)

      *Original content in EML file:*    

      exalted

      Psalm 89:17

       

      3 Choa Chu Kang Avenue 4    

      Original content: exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa Chu Kang Avenue 4, Singapore

      Index content: exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3 Choa Chu Kang Avenue 4, Singapore

       

      Example 3: The sentence that the above regex pattern is partially working (as you can see, instead of 2 <br>, there are 4 <br>)

      *Original content in EML file:*    

      http://www.concordpri.moe.edu.sg/

       

       

       

       

      On Tue, Dec 18, 2018 at 10:07 AM    

      Original content: http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018 at 10:07 AM 

      Index content: http://www.concordpri.moe.edu.sg/   <br><br>  <br><br>On Tue, Dec 18, 2018 at 10:07 AM

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              edwinyeozl Edwin Yeo Zheng Lin
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated: