Details
-
New Feature
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
None
-
None
Description
It would be nice to have a default tokenizer in core. A tokenizer based on the Unicode word boundaries defined in UAX #29 Unicode Text Segmentation seems like a good choice. That's also how Lucene's StandardTokenizer works.
See the following thread on lucy-dev
http://mail-archives.apache.org/mod_mbox/incubator-lucy-dev/201111.mbox/browser