Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-1490

CJKTokenizer convert HALFWIDTH_AND_FULLWIDTH_FORMS wrong

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.4, 2.9
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      CJKTokenizer have these lines..
      if (ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS)

      { /** convert HALFWIDTH_AND_FULLWIDTH_FORMS to BASIC_LATIN */ int i = (int) c; i = i - 65248; c = (char) i; }

      This is wrong. Some character in the block (e.g. U+ff68) have no BASIC_LATIN counterparts.
      Only 65281-65374 can be converted this way.

      The fix is

      if (ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS && i <= 65474 && i> 65281) { /** convert HALFWIDTH_AND_FULLWIDTH_FORMS to BASIC_LATIN */ int i = (int) c; i = i - 65248; c = (char) i; }

        Attachments

          Activity

            People

            • Assignee:
              mikemccand Michael McCandless
              Reporter:
              sdiz Daniel Cheng
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: