Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-8526

StandardTokenizer doesn't separate hangul characters from other non-CJK chars

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Not A Bug
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      It was first reported here https://github.com/elastic/elasticsearch/issues/34285.
      I don't know if it's the expected behavior but the StandardTokenizer does not split words
      which are composed of a mixed of non-CJK characters and hangul syllabs. For instance "한국2018" or "한국abc" is kept as is by this tokenizer and mark as an alpha-numeric group. This breaks the CJKBigram token filter which will not build bigrams on such groups. The other CJK characters are correctly splitted when they are mixed with other alphabet so I'd expect the same for hangul.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                jim.ferenczi Jim Ferenczi
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: