Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-1689

supplementary character handling

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • None
    • modules/analysis
    • None
    • New

    Description

      for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.

      supplementary character support should be fixed for code that works with char/char[]

      For example:
      StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
      LowercaseFilter should be modified to lowercase suppl. characters correctly.
      CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.

      in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

      Attachments

        1. testCurrentBehavior.txt
          8 kB
          Robert Muir
        2. LUCENE-1689.patch
          7 kB
          Robert Muir
        3. LUCENE-1689.patch
          19 kB
          Robert Muir
        4. LUCENE-1689.patch
          52 kB
          Robert Muir
        5. LUCENE-1689_lowercase_example.txt
          1.0 kB
          Robert Muir

        Issue Links

          Activity

            People

              Unassigned Unassigned
              rcmuir Robert Muir
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: