Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-1032

CJKAnalyzer should convert half width katakana to full width katakana

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 2.0.0
    • None
    • modules/analysis
    • None
    • New

    Description

      Some of our Japanese customers are reporting errors when performing searches using half width characters.
      The desired behavior is that a document containing half width characters should be returned when performing a search using full width equivalents or when searching by the half width character itself.
      Currently, a search will not return any matches for half width characters.

      Here is a test case outlining desired behavior (this may require a new Analyzer).

      public class TestJapaneseEncodings extends TestCase
      {
      
          byte[] fullWidthKa = new byte[]{(byte) 0xE3, (byte) 0x82, (byte) 0xAB};
          byte[] halfWidthKa = new byte[]{(byte) 0xEF, (byte) 0xBD, (byte) 0xB6};
      
          public void testAnalyzerWithHalfWidth() throws IOException
          {
              Reader r1 = new StringReader(makeHalfWidthKa());
              TokenStream stream = new CJKAnalyzer().tokenStream("foo", r1);
              assertNotNull(stream);
              Token token = stream.next();
              assertNotNull(token);
              assertEquals(makeFullWidthKa(), token.termText());
          }
      
          public void testAnalyzerWithFullWidth() throws IOException
          {
              Reader r1 = new StringReader(makeFullWidthKa());
              TokenStream stream = new CJKAnalyzer().tokenStream("foo", r1);
              assertEquals(makeFullWidthKa(), stream.next().termText());
          }
      
          private String makeFullWidthKa() throws UnsupportedEncodingException
          {
              return new String(fullWidthKa, "UTF-8");
          }
      
          private String makeHalfWidthKa() throws UnsupportedEncodingException
          {
              return new String(halfWidthKa, "UTF-8");
          }
      }
      
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            alynch Andrew Lynch
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: