Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-1490

CJKTokenizer convert HALFWIDTH_AND_FULLWIDTH_FORMS wrong

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 2.4, 2.9
    • None
    • None
    • New, Patch Available

    Description

      CJKTokenizer have these lines..
      if (ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS)

      { /** convert HALFWIDTH_AND_FULLWIDTH_FORMS to BASIC_LATIN */ int i = (int) c; i = i - 65248; c = (char) i; }

      This is wrong. Some character in the block (e.g. U+ff68) have no BASIC_LATIN counterparts.
      Only 65281-65374 can be converted this way.

      The fix is

      if (ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS && i <= 65474 && i> 65281) { /** convert HALFWIDTH_AND_FULLWIDTH_FORMS to BASIC_LATIN */ int i = (int) c; i = i - 65248; c = (char) i; }

      Attachments

        Activity

          People

            mikemccand Michael McCandless
            sdiz Daniel Cheng
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: