[LUCENE-2070] document LengthFilter wrt Unicode 4.0 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Trivial
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 3.1, 4.0-ALPHA
Component/s: modules/analysis
Labels:
None

Lucene Fields:

New, Patch Available

Description

LengthFilter calculates its min/max length from TermAttribute.termLength()
This is not characters, but instead UTF-16 code units.

In my opinion this should not be changed, merely documented.
If we changed it, it would have an adverse performance impact because we would have to actually calculate Character.codePointCount() on the text.

If you feel strongly otherwise, fixing it to count codepoints would be a trivial patch, but I'd rather not hurt performance.
I admit I don't fully understand all the use cases for this filter.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-2070.patch
16/Nov/09 17:42
0.5 kB
Robert Muir

Issue Links

is part of

LUCENE-1689 supplementary character handling

Resolved

Activity

People

Assignee:: Robert Muir

Reporter:: Robert Muir

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 16/Nov/09 17:41

Updated:: 28/Aug/22 12:14

Resolved:: 24/Sep/10 00:48