Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-9177

ICUNormalizer2CharFilter worst case is very slow

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 9.0, 8.10
    • None
    • None
    • New

    Description

      ICUNormalizer2CharFilter is fast most of the times but we've had some report in Elasticsearch that some unrealistic data can slow down the process very significantly. For instance an input that consists of characters to normalize with no normalization-inert character in between can take up to several seconds to process few hundreds of kilo-bytes on my machine. While the input is not realistic, this worst case can slow down indexing considerably when dealing with uncleaned data.

      I attached a small test that reproduces the slow processing using a stream that contains a lot of repetition of the character `℃` and no normalization-inert character. I am not surprised that the processing is slower than usual but several seconds to process seems a lot. Adding normalization-inert character makes the processing a lot more faster so I wonder if we can improve the process to split the input more eagerly ?

       

      Attachments

        1. LUCENE-9177-benchmark-test.patch
          4 kB
          Michael Gibney
        2. LUCENE-9177_LUCENE-8972.patch
          4 kB
          Michael Gibney
        3. lucene.patch
          3 kB
          Jim Ferenczi

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jim.ferenczi Jim Ferenczi
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m