[LUCENE-2747] Deprecate/remove language-specific tokenizers in favor of StandardTokenizer - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 3.1, 4.0-ALPHA
Fix Version/s: 3.1, 4.0-ALPHA
Component/s: modules/analysis
Labels:
None

Lucene Fields:

New

Description

As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to provide language-neutral tokenization. Lucene contains several language-specific tokenizers that should be replaced by UAX#29-based StandardTokenizer (deprecated in 3.1 and removed in 4.0). The language-specific analyzers, by contrast, should remain, because they contain language-specific post-tokenization filters. The language-specific analyzers should switch to StandardTokenizer in 3.1.

Some usages of language-specific tokenizers will need additional work beyond just replacing the tokenizer in the language-specific analyzer.

For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends on the fact that this tokenizer breaks tokens on the ZWNJ character (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ is not a word boundary. Robert Muir has suggested using a char filter converting ZWNJ to spaces prior to StandardTokenizer in the converted PersianAnalyzer.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-2747.patch
09/Nov/10 11:54
15 kB
Robert Muir
LUCENE-2747.patch
08/Nov/10 20:56
15 kB
Robert Muir

Activity

People

Assignee:: Robert Muir

Reporter:: Steven Rowe

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 08/Nov/10 20:07

Updated:: 28/Aug/22 12:35

Resolved:: 07/Dec/10 19:52