[LUCENE-9177] ICUNormalizer2CharFilter worst case is very slow - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 9.0, 8.10
Component/s: None
Labels:
None

Lucene Fields:

New

Description

ICUNormalizer2CharFilter is fast most of the times but we've had some report in Elasticsearch that some unrealistic data can slow down the process very significantly. For instance an input that consists of characters to normalize with no normalization-inert character in between can take up to several seconds to process few hundreds of kilo-bytes on my machine. While the input is not realistic, this worst case can slow down indexing considerably when dealing with uncleaned data.

I attached a small test that reproduces the slow processing using a stream that contains a lot of repetition of the character `℃` and no normalization-inert character. I am not surprised that the processing is slower than usual but several seconds to process seems a lot. Adding normalization-inert character makes the processing a lot more faster so I wonder if we can improve the process to split the input more eagerly ?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

lucene.patch
27/Jan/20 12:33
3 kB
Jim Ferenczi
LUCENE-9177_LUCENE-8972.patch
28/Jun/21 20:27
4 kB
Michael Gibney
LUCENE-9177-benchmark-test.patch
01/Jul/21 19:52
4 kB
Michael Gibney

Issue Links

links to

GitHub Pull Request #199

Activity

People

Assignee:: Unassigned

Reporter:: Jim Ferenczi

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 27/Jan/20 12:33

Updated:: 28/Aug/22 15:57

Resolved:: 14/Jul/21 01:58

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

20m