Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-2070

document LengthFilter wrt Unicode 4.0

Details

    • Improvement
    • Status: Closed
    • Trivial
    • Resolution: Fixed
    • None
    • 3.1, 4.0-ALPHA
    • modules/analysis
    • None
    • New, Patch Available

    Description

      LengthFilter calculates its min/max length from TermAttribute.termLength()
      This is not characters, but instead UTF-16 code units.

      In my opinion this should not be changed, merely documented.
      If we changed it, it would have an adverse performance impact because we would have to actually calculate Character.codePointCount() on the text.

      If you feel strongly otherwise, fixing it to count codepoints would be a trivial patch, but I'd rather not hurt performance.
      I admit I don't fully understand all the use cases for this filter.

      Attachments

        1. LUCENE-2070.patch
          0.5 kB
          Robert Muir

        Issue Links

          Activity

            People

              rcmuir Robert Muir
              rcmuir Robert Muir
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: