Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-5096

WhitespaceTokenizer supports Java whitespace, should also support Unicode whitespace

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Duplicate
    • 4.3.1
    • None
    • modules/analysis
    • None
    • all

    • New

    Description

      The whitespace tokenizer supports only Java whitespace as defined in http://docs.oracle.com/javase/6/docs/api/java/lang/Character.html#isWhitespace(char)

      A useful improvement would be to support also Unicode whitespace as defined in the Unicode property list http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jprante Jörg Prante
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: