Lucene - Core
  1. Lucene - Core
  2. LUCENE-444

StandardTokenizer loses Korean characters

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.9
    • Component/s: modules/analysis
    • Labels:
      None

      Description

      While using StandardAnalyzer, exp. StandardTokenizer with Korean text stream, StandardTokenizer ignores the Korean characters. This is because the definition of CJK token in StandardTokenizer.jj JavaCC file doesn't have enough range covering Korean syllables described in Unicode character map.
      This patch adds one line of 0xAC00~0xD7AF, the Korean syllables range to the StandardTokenizer.jj code.

        Activity

        Cheolgoo Kang created issue -
        Hide
        Cheolgoo Kang added a comment -

        This patch adds one line of 0xAC00~0xD7AF, the Korean syllables range to the StandardTokenizer.jj code.

        Show
        Cheolgoo Kang added a comment - This patch adds one line of 0xAC00~0xD7AF, the Korean syllables range to the StandardTokenizer.jj code.
        Cheolgoo Kang made changes -
        Field Original Value New Value
        Attachment StandardTokenizer_Korean.patch [ 12314710 ]
        Hide
        Otis Gospodnetic added a comment -

        Committed. Thanks Cheolgoo.

        Show
        Otis Gospodnetic added a comment - Committed. Thanks Cheolgoo.
        Otis Gospodnetic made changes -
        Fix Version/s 1.9 [ 12310334 ]
        Resolution Fixed [ 1 ]
        Status Open [ 1 ] Resolved [ 5 ]
        Hide
        Erik Hatcher added a comment -

        I'm closing this issue... but some unit tests would be nice to go along with this too, eventually

        Show
        Erik Hatcher added a comment - I'm closing this issue... but some unit tests would be nice to go along with this too, eventually
        Erik Hatcher made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Mark Thomas made changes -
        Workflow jira [ 12330777 ] Default workflow, editable Closed status [ 12562188 ]
        Mark Thomas made changes -
        Workflow Default workflow, editable Closed status [ 12562188 ] jira [ 12583200 ]
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Resolved Resolved
        13h 28m 1 Otis Gospodnetic 05/Oct/05 13:54
        Resolved Resolved Closed Closed
        6h 46m 1 Erik Hatcher 05/Oct/05 20:40

          People

          • Assignee:
            Unassigned
            Reporter:
            Cheolgoo Kang
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development