Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-461

StandardTokenizer splitting all of Korean words into separate characters

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.9
    • Component/s: modules/analysis
    • Labels:
      None
    • Environment:

      Analyzing Korean text with Apache Lucene, esp. with StandardAnalyzer.

      Description

      StandardTokenizer splits all those Korean words inth separate character tokens. For example, "?????" is one Korean word that means "Hello", but StandardAnalyzer separates it into five tokens of "?", "?", "?", "?", "?".

        Attachments

        1. StandardTokenizer_KoreanWord.patch
          1 kB
          Cheolgoo Kang
        2. TestStandardAnalyzer_KoreanWord.patch
          0.4 kB
          Cheolgoo Kang

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              appler Cheolgoo Kang
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: