[LUCENE-2906] Filter to process output of ICUTokenizer and create overlapping bigrams for CJK - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 3.6, 4.0-ALPHA
Component/s: modules/analysis
Labels:
None

Lucene Fields:

New

Description

The ICUTokenizer produces unigrams for CJK. We would like to use the ICUTokenizer but have overlapping bigrams created for CJK as in the CJK Analyzer. This filter would take the output of the ICUtokenizer, read the ScriptAttribute and for selected scripts (Han, Kana), would produce overlapping bigrams.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-2906.patch
27/Dec/11 15:04
55 kB
Robert Muir
LUCENE-2906.patch
27/Dec/11 04:06
54 kB
Robert Muir
LUCENE-2906.patch
17/Dec/11 03:55
49 kB
Robert Muir
LUCENE-2906.patch
06/Feb/11 13:30
27 kB
Robert Muir

Sub-Tasks

synchronize grammar/token types across StandardTokenizer, UAX29EmailURLTokenizer, ICUTokenizer, add CJK types.

Closed

Robert Muir

Activity

People

Assignee:: Robert Muir

Reporter:: Tom Burton-West

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 04/Feb/11 20:30

Updated:: 28/Aug/22 12:40

Resolved:: 29/Dec/11 05:24