Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-461

StandardTokenizer splitting all of Korean words into separate characters

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 1.9
    • modules/analysis
    • None
    • Analyzing Korean text with Apache Lucene, esp. with StandardAnalyzer.

    Description

      StandardTokenizer splits all those Korean words inth separate character tokens. For example, "?????" is one Korean word that means "Hello", but StandardAnalyzer separates it into five tokens of "?", "?", "?", "?", "?".

      Attachments

        1. StandardTokenizer_KoreanWord.patch
          1 kB
          Cheolgoo Kang
        2. TestStandardAnalyzer_KoreanWord.patch
          0.4 kB
          Cheolgoo Kang

        Activity

          People

            Unassigned Unassigned
            appler Cheolgoo Kang
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: