Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-444

StandardTokenizer loses Korean characters

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Patch Available
    • Minor
    • Resolution: Fixed
    • None
    • 1.9
    • modules/analysis
    • None

    Description

      While using StandardAnalyzer, exp. StandardTokenizer with Korean text stream, StandardTokenizer ignores the Korean characters. This is because the definition of CJK token in StandardTokenizer.jj JavaCC file doesn't have enough range covering Korean syllables described in Unicode character map.
      This patch adds one line of 0xAC00~0xD7AF, the Korean syllables range to the StandardTokenizer.jj code.

      Attachments

        1. StandardTokenizer_Korean.patch
          0.3 kB
          Cheolgoo Kang

        Activity

          People

            Unassigned Unassigned
            appler Cheolgoo Kang
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: