Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-6874

WhitespaceTokenizer should tokenize on NBSP

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.4, 6.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      WhitespaceTokenizer uses Character.isWhitespace to decide what is whitespace. Here's a pertinent excerpt:

      It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', '\u2007', '\u202F')

      Perhaps Character.isWhitespace should have been called isLineBreakableWhitespace?

      I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to work around but why leave this trap in by default?

        Attachments

        1. icu-datasucker.patch
          2 kB
          Uwe Schindler
        2. LUCENE_6874_jflex.patch
          49 kB
          David Smiley
        3. LUCENE-6874.patch
          8 kB
          Uwe Schindler
        4. LUCENE-6874-chartokenizer.patch
          27 kB
          Uwe Schindler
        5. LUCENE-6874-chartokenizer.patch
          27 kB
          Uwe Schindler
        6. LUCENE-6874-chartokenizer.patch
          23 kB
          Uwe Schindler
        7. LUCENE-6874-jflex.patch
          56 kB
          Steve Rowe
        8. unicode-ws-tokenizer.patch
          13 kB
          Uwe Schindler
        9. unicode-ws-tokenizer.patch
          13 kB
          Uwe Schindler
        10. unicode-ws-tokenizer.patch
          10 kB
          Uwe Schindler

          Issue Links

            Activity

              People

              • Assignee:
                thetaphi Uwe Schindler
                Reporter:
                dsmiley David Smiley
              • Votes:
                0 Vote for this issue
                Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: