[LUCENE-1343] A replacement for AsciiFoldingFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 3.1
Fix Version/s: 3.1, 4.0-ALPHA
Component/s: modules/analysis
Labels:
None

Lucene Fields:

New, Patch Available

Description

The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces them with a version of that character with the diacritical mark removed. For example é becomes e. However another equally valid way of representing an accented character in Unicode is to have the unaccented character followed by a non-spacing modifier character (like this: é ) The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters at all. Additionally there are some instances where a word will contain what looks like an accented character, that is actually considered to be a separate unaccented character such as Ł but which to make searching easier you want to fold onto the latin1 lookalike version L .

The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they occur as composed characters or decomposed characters, it can also handle cases where as described above characters that look like they have diacritics (but don't) are to be folded onto the letter that they look like ( Ł -> L )

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-1343.patch
20/Apr/10 14:35
183 kB
Robert Muir
LUCENE-1343.patch
19/Apr/10 08:35
176 kB
Robert Muir
normalizer.jar
22/Jul/08 19:41
390 kB
Robert Haschart
UnicodeCharUtil.java
22/Jul/08 19:33
25 kB
Robert Haschart
UnicodeNormalizationFilter.java
22/Jul/08 19:33
4 kB
Robert Haschart
UnicodeNormalizationFilterFactory.java
22/Jul/08 19:33
9 kB
Robert Haschart
utr30.nrm
20/Apr/10 14:36
41 kB
Robert Muir
utr30.nrm
19/Apr/10 08:37
41 kB
Robert Muir

Issue Links

is related to

LUCENE-1390 add ASCIIFoldingFilter and deprecate ISOLatin1AccentFilter

Closed

Activity

People

Assignee:: Robert Muir

Reporter:: Robert Haschart

Votes:: 1 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 22/Jul/08 19:28

Updated:: 28/Aug/22 11:51

Resolved:: 06/May/10 12:28