Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-6874

WhitespaceTokenizer should tokenize on NBSP

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 5.4, 6.0
    • modules/analysis
    • None
    • New

    Description

      WhitespaceTokenizer uses Character.isWhitespace to decide what is whitespace. Here's a pertinent excerpt:

      It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', '\u2007', '\u202F')

      Perhaps Character.isWhitespace should have been called isLineBreakableWhitespace?

      I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to work around but why leave this trap in by default?

      Attachments

        1. LUCENE-6874.patch
          8 kB
          Uwe Schindler
        2. LUCENE-6874-jflex.patch
          56 kB
          Steven Rowe
        3. LUCENE_6874_jflex.patch
          49 kB
          David Smiley
        4. icu-datasucker.patch
          2 kB
          Uwe Schindler
        5. unicode-ws-tokenizer.patch
          10 kB
          Uwe Schindler
        6. unicode-ws-tokenizer.patch
          13 kB
          Uwe Schindler
        7. unicode-ws-tokenizer.patch
          13 kB
          Uwe Schindler
        8. LUCENE-6874-chartokenizer.patch
          23 kB
          Uwe Schindler
        9. LUCENE-6874-chartokenizer.patch
          27 kB
          Uwe Schindler
        10. LUCENE-6874-chartokenizer.patch
          27 kB
          Uwe Schindler

        Issue Links

          Activity

            People

              uschindler Uwe Schindler
              dsmiley David Smiley
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: