Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-2244

Improve StandardTokenizer's understanding of non ASCII punctuation and quotes

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.0
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      In the vein of LUCENE-1126 and LUCENE-1390, StandardTokenizerImpl.jflex should do a better job at understanding non-ASCII punctuation characters.

      For example, its understanding of the single-quote character "'" is currently limited to that character only. It will set a token's type to APOSTROPHE only if the "'" was used.
      In the patch attached, I added all the characters that ASCIIFoldingFilter would change into "'".

      I'm not sure that this is the right approach so I didn't write a complete patch for all the other hardcoded characters used in jflex rules such as ".", "-" which have some variants in ASCIIFoldingFilter that could be used as well.

      Maybe a better approach would be to make it possible to have an ASCIIFoldingFilter-like reader as a character filter that could be in inserted in front of StandardTokenizer ?

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                vajda Andi Vajda
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: