Details
-
New Feature
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
3.1, 4.0-ALPHA
-
None
-
Patch Available
Description
It would be really nice for StandardTokenizer to adhere straight to the standard as much as we can with jflex. Then its name would actually make sense.
Such a transition would involve renaming the old StandardTokenizer to EuropeanTokenizer, as its javadoc claims:
This should be a good tokenizer for most European-language documents
The new StandardTokenizer could then say
This should be a good tokenizer for most languages.
All the english/euro-centric stuff like the acronym/company/apostrophe stuff can stay with that EuropeanTokenizer, and it could be used by the european analyzers.
Attachments
Attachments
Issue Links
- incorporates
-
LUCENE-1545 Standard analyzer does not correctly tokenize combining character U+0364 COMBINING LATIN SMALL LETTRE E
- Closed
-
LUCENE-1702 Thai token type() bug
- Closed
-
LUCENE-1556 some valid email address characters not correctly recognized
- Closed
- is related to
-
LUCENE-2763 Swap URL+Email recognizing StandardTokenizer and UAX29Tokenizer
- Closed
- relates to
-
LUCENE-2244 Improve StandardTokenizer's understanding of non ASCII punctuation and quotes
- Closed