[LUCENE-2644] LowerCaseTokenizer Does Not Behave As One Might Expect (or Desire)--Given Its Name - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Wish
Status: Resolved
Priority: Major
Resolution: Won't Fix
Affects Version/s: 3.0.2
Fix Version/s: None
Component/s: modules/analysis
Labels:
None

Lucene Fields:

New, Patch Available

Description

While I understand some of the reasons for its design, the original LowerCaseTokenizer should have been named LowerCaseLetterTokenizer.

I feel that LowerCaseTokenizer makes too many assumptions about what too tokenize, and I have therefore patched it. The default behavior will remain as it always has--to avoid breaking any implementations for which it's being used.

I have changed LowerCaseTokenizer to extend CharTokenizer (rather than LetterTokenizer). LetterTokenizer's functionality was merged into the default behavior of LowerCaseTokenizer.

Getter/Setter methods have been added to the LowerCaseTokenizer Class, allowing you to turn on / off tokenizing by white space, numbers, and special (Non-Alpha/Numeric) characters.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LowerCaseTokenizer.patch
14/Sep/10 21:17
6 kB
Scott Gonyea

Activity

People

Assignee:: Unassigned

Reporter:: Scott Gonyea

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 14/Sep/10 21:15

Updated:: 28/Aug/22 12:32

Resolved:: 18/Sep/12 17:47