Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-461

StandardTokenizer splitting all of Korean words into separate characters

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.9
    • Component/s: modules/analysis
    • Labels:
      None
    • Environment:

      Analyzing Korean text with Apache Lucene, esp. with StandardAnalyzer.

      Description

      StandardTokenizer splits all those Korean words inth separate character tokens. For example, "?????" is one Korean word that means "Hello", but StandardAnalyzer separates it into five tokens of "?", "?", "?", "?", "?".

        Activity

        Hide
        appler Cheolgoo Kang added a comment -

        Here are patches to preserve one Korean word not to be separated into each characters. The TestStandardAnalyzer test case attached has passed with StandardTokenizer with patch applied.

        Show
        appler Cheolgoo Kang added a comment - Here are patches to preserve one Korean word not to be separated into each characters. The TestStandardAnalyzer test case attached has passed with StandardTokenizer with patch applied.
        Hide
        ehatcher Erik Hatcher added a comment -

        These patches have been applied, thanks!

        There is one thing to note, and that is a change in the token type emitted from "<CJK>" to "<CJ>". It is possible that folks have written code to rely on that, but this token type is currently brittle as it is based on the JavaCC grammar definition and I view this as an acceptable break in full backwards compatibility because it is unlikely that anyone is using that token type.

        Show
        ehatcher Erik Hatcher added a comment - These patches have been applied, thanks! There is one thing to note, and that is a change in the token type emitted from "<CJK>" to "<CJ>". It is possible that folks have written code to rely on that, but this token type is currently brittle as it is based on the JavaCC grammar definition and I view this as an acceptable break in full backwards compatibility because it is unlikely that anyone is using that token type.

          People

          • Assignee:
            Unassigned
            Reporter:
            appler Cheolgoo Kang
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development