Issue Details (XML | Word | Printable)

Key: LUCENE-1003
Type: Bug Bug
Status: Resolved Resolved
Resolution: Fixed
Priority: Major Major
Assignee: Otis Gospodnetic
Reporter: OpenTeam.ru
Votes: 1
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Lucene - Java

[PATCH] RussianAnalyzer's tokenizer skips numbers from input text,

Created: 19/Sep/07 05:00 AM   Updated: 14/May/08 05:38 AM
Return to search
Component/s: Analysis
Affects Version/s: 2.2
Fix Version/s: None

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works RussianCharsets.java.patch 2007-09-19 07:54 AM OpenTeam.ru 1 kB
Text File Licensed for inclusion in ASF works TestRussianAnalyzer.java.patch 2008-02-19 04:04 AM Dmitry Lihachev 0.7 kB

Lucene Fields: Patch Available, New
Resolution Date: 14/May/08 05:38 AM


 Description  « Hide
RussianAnalyzer's tokenizer skips numbers from input text, so that resulting token stream miss numbers. Problem can be solved by adding numbers to RussianCharsets.UnicodeRussian. See test case below for details.
TestRussianAnalyzer.java
public class TestRussianAnalyzer extends TestCase {

  Reader reader = new StringReader("text 1000");

  // test FAILS
  public void testStemmer() {
    testAnalyzer(new RussianAnalyzer());
  }

  // test PASSES
  public void testFixedRussianAnalyzer() {
    testAnalyzer(new RussianAnalyzer(getRussianCharSet()));
  }

  private void testAnalyzer(RussianAnalyzer analyzer) {
    try {
      TokenStream stream = analyzer.tokenStream("text", reader);
      assertEquals("text", stream.next().termText());
      assertNotNull(stream.next());
    } catch (IOException e) {
      fail(e.getMessage());
    }
  }

  private char[] getRussianCharSet() {
    int length = RussianCharsets.UnicodeRussian.length;
    final char[] russianChars = new char[length + 10];

    System
        .arraycopy(RussianCharsets.UnicodeRussian, 0, russianChars, 0, length);
    russianChars[length++] = '0';
    russianChars[length++] = '1';
    russianChars[length++] = '2';
    russianChars[length++] = '3';
    russianChars[length++] = '4';
    russianChars[length++] = '5';
    russianChars[length++] = '6';
    russianChars[length++] = '7';
    russianChars[length++] = '8';
    russianChars[length] = '9';
    return russianChars;
  }
}


 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order