[LUCENE-2244] Improve StandardTokenizer's understanding of non ASCII punctuation and quotes - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 3.0
Fix Version/s: 3.1, 4.0-ALPHA
Component/s: modules/analysis
Labels:
None

Lucene Fields:

New, Patch Available

Description

In the vein of ~~LUCENE-1126~~ and ~~LUCENE-1390~~, StandardTokenizerImpl.jflex should do a better job at understanding non-ASCII punctuation characters.

For example, its understanding of the single-quote character "'" is currently limited to that character only. It will set a token's type to APOSTROPHE only if the "'" was used.
In the patch attached, I added all the characters that ASCIIFoldingFilter would change into "'".

I'm not sure that this is the right approach so I didn't write a complete patch for all the other hardcoded characters used in jflex rules such as ".", "-" which have some variants in ASCIIFoldingFilter that could be used as well.

Maybe a better approach would be to make it possible to have an ASCIIFoldingFilter-like reader as a character filter that could be in inserted in front of StandardTokenizer ?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

StandardTokenizerImpl.jflex.diff
30/Jan/10 23:36
0.8 kB
Andi Vajda

Issue Links

depends upon

LUCENE-2074 Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer

Closed

is related to

LUCENE-2167 Implement StandardTokenizer with the UAX#29 Standard

Closed

SOLR-2013 ASCIIFoldingFilter => MappingCharFilterFactory as a mapping file

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Andi Vajda

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 30/Jan/10 23:34

Updated:: 28/Aug/22 12:19

Resolved:: 26/Jan/11 12:09