[LUCY-196] UAX #29 tokenizer - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.3.0 (incubating)
Component/s: Analysis
Labels:
None

Description

It would be nice to have a default tokenizer in core. A tokenizer based on the Unicode word boundaries defined in UAX #29 Unicode Text Segmentation seems like a good choice. That's also how Lucene's StandardTokenizer works.

See the following thread on lucy-dev
http://mail-archives.apache.org/mod_mbox/incubator-lucy-dev/201111.mbox/browser

Also see
http://unicode.org/reports/tr29/#Word_Boundaries

Attachments

Activity

People

Assignee:: Nikolas Wellnhofer

Reporter:: Nikolas Wellnhofer

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 04/Dec/11 15:08

Updated:: 13/Dec/11 00:42

Resolved:: 13/Dec/11 00:42