[LUCENE-2414] add icu-based tokenizer for unicode text segmentation - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 3.1
Fix Version/s: 3.1, 4.0-ALPHA
Component/s: modules/other
Labels:
None

Lucene Fields:

New, Patch Available

Description

I pulled out the last part of ~~LUCENE-1488~~, the tokenizer itself and cleaned it up some.

The idea is simple:

First step is to divide text into writing system boundaries (scripts)
You supply an ICUTokenizerConfig (or just use the default) which lets you tailor segmentation on a per-writing system basis.
This tailoring can be any BreakIterator, so rule-based or dictionary-based or your own.

The default implementation (if you do not customize) is just to do UAX#29, but with tailorings for stuff with no clear word division:

Thai (uses dictionary-based word breaking)
Khmer, Myanmar, Lao (uses custom rules for syllabification)

Additionally as more of an example i have a tailoring for hebrew that treats the punctuation special. (People have asked before
for ways to make standardanalyzer treat dashes differently, etc)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-2414.patch
01/May/10 12:10
98 kB
Robert Muir
LUCENE-2414.patch
24/Apr/10 14:42
95 kB
Robert Muir
LUCENE-2414.patch
23/Apr/10 06:56
95 kB
Uwe Schindler
LUCENE-2414.patch
22/Apr/10 21:51
94 kB
Robert Muir
LUCENE-2414.patch
22/Apr/10 20:54
96 kB
Robert Muir

Issue Links

is part of

LUCENE-1488 multilingual analyzer based on icu

Closed

Activity

People

Assignee:: Robert Muir

Reporter:: Robert Muir

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 22/Apr/10 20:52

Updated:: 28/Aug/22 12:25

Resolved:: 06/May/10 12:55