Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-444

StandardTokenizer loses Korean characters

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.9
    • Component/s: modules/analysis
    • Labels:
      None

      Description

      While using StandardAnalyzer, exp. StandardTokenizer with Korean text stream, StandardTokenizer ignores the Korean characters. This is because the definition of CJK token in StandardTokenizer.jj JavaCC file doesn't have enough range covering Korean syllables described in Unicode character map.
      This patch adds one line of 0xAC00~0xD7AF, the Korean syllables range to the StandardTokenizer.jj code.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              appler Cheolgoo Kang

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment