Description
We are using the RegexReplaceProcessorFactory, and have tried with all of the following configurations in solrconfig.xml:
<processor class="solr.RegexReplaceProcessorFactory">
<str name="fieldName">content</str>
<str name="pattern">(\s*\r?\n){2,}</str>
<str name="replacement"><br><br></str>
<bool name="literalReplacement">true</bool>
</processor>
<processor class="solr.RegexReplaceProcessorFactory">
<str name="fieldName">content</str>
<str name="pattern">([ \s]*\r?\n){2,}</str>
<str name="replacement"><br><br></str>
<bool name="literalReplacement">true</bool>
</processor>
<processor class="solr.RegexReplaceProcessorFactory">
<str name="fieldName">content</str>
<str name="pattern">(\s*\n){2,}</str>
<str name="replacement"><br><br></str>
<bool name="literalReplacement">true</bool>
</processor>
<processor class="solr.RegexReplaceProcessorFactory">
<str name="fieldName">content</str>
<str name="pattern">(\n\s*){2,}</str>
<str name="replacement"><br><br></str>
<bool name="literalReplacement">true</bool>
</processor>
The regex pattern of (\s*\r?\n){2,}, ([ \s]\r?\n){2,}, (\s\n){2,} and (\n\s*){2,} are working perfectly in regex101.com, in which all the \n will be replaced by only two <br>
However, in Solr, there are cases (in Example 2 and 3 below) that has four <br> in a row. This should not be the case, as we have already set it to replace by two <br> regardless of how many \n are there in a row.
Example 1: The sentence that the above regex pattern is working correctly
*Original content in EML file:*
Dear Sir,
I am terminating
Original content: Dear Sir, \n\n \n \n\n I am terminating
Index content: Dear Sir, <br><br>I am terminating
Example 2: The sentence that the above regex pattern is partially working (as you can see, instead of 2 <br>, there are 4 <br>)
*Original content in EML file:*
exalted
Psalm 89:17
3 Choa Chu Kang Avenue 4
Original content: exalted \n \n\n Psalm 89:17 \n\n \n\n 3 Choa Chu Kang Avenue 4, Singapore
Index content: exalted <br><br>Psalm 89:17 <br><br> <br><br>3 Choa Chu Kang Avenue 4, Singapore
Example 3: The sentence that the above regex pattern is partially working (as you can see, instead of 2 <br>, there are 4 <br>)
*Original content in EML file:*
http://www.concordpri.moe.edu.sg/
On Tue, Dec 18, 2018 at 10:07 AM
Original content: http://www.concordpri.moe.edu.sg/ \n\n \n\n \n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n On Tue, Dec 18, 2018 at 10:07 AM
Index content: http://www.concordpri.moe.edu.sg/ <br><br> <br><br>On Tue, Dec 18, 2018 at 10:07 AM