Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
3.0
-
None
-
New, Patch Available
Description
In the vein of LUCENE-1126 and LUCENE-1390, StandardTokenizerImpl.jflex should do a better job at understanding non-ASCII punctuation characters.
For example, its understanding of the single-quote character "'" is currently limited to that character only. It will set a token's type to APOSTROPHE only if the "'" was used.
In the patch attached, I added all the characters that ASCIIFoldingFilter would change into "'".
I'm not sure that this is the right approach so I didn't write a complete patch for all the other hardcoded characters used in jflex rules such as ".", "-" which have some variants in ASCIIFoldingFilter that could be used as well.
Maybe a better approach would be to make it possible to have an ASCIIFoldingFilter-like reader as a character filter that could be in inserted in front of StandardTokenizer ?
Attachments
Attachments
Issue Links
- depends upon
-
LUCENE-2074 Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer
-
- Closed
-
- is related to
-
LUCENE-2167 Implement StandardTokenizer with the UAX#29 Standard
-
- Closed
-
-
SOLR-2013 ASCIIFoldingFilter => MappingCharFilterFactory as a mapping file
-
- Closed
-