Issue Details (XML | Word | Printable)

Key: LUCENE-1003
Type: Bug Bug
Status: Resolved Resolved
Resolution: Fixed
Priority: Major Major
Assignee: Otis Gospodnetic
Reporter: OpenTeam.ru
Votes: 1
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Lucene - Java

[PATCH] RussianAnalyzer's tokenizer skips numbers from input text,

Created: 19/Sep/07 05:00 AM   Updated: 14/May/08 05:38 AM
Return to search
Component/s: Analysis
Affects Version/s: 2.2
Fix Version/s: None

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works RussianCharsets.java.patch 2007-09-19 07:54 AM OpenTeam.ru 1 kB
Text File Licensed for inclusion in ASF works TestRussianAnalyzer.java.patch 2008-02-19 04:04 AM Dmitry Lihachev 0.7 kB

Lucene Fields: Patch Available, New
Resolution Date: 14/May/08 05:38 AM


 Description  « Hide
RussianAnalyzer's tokenizer skips numbers from input text, so that resulting token stream miss numbers. Problem can be solved by adding numbers to RussianCharsets.UnicodeRussian. See test case below for details.
TestRussianAnalyzer.java
public class TestRussianAnalyzer extends TestCase {

  Reader reader = new StringReader("text 1000");

  // test FAILS
  public void testStemmer() {
    testAnalyzer(new RussianAnalyzer());
  }

  // test PASSES
  public void testFixedRussianAnalyzer() {
    testAnalyzer(new RussianAnalyzer(getRussianCharSet()));
  }

  private void testAnalyzer(RussianAnalyzer analyzer) {
    try {
      TokenStream stream = analyzer.tokenStream("text", reader);
      assertEquals("text", stream.next().termText());
      assertNotNull(stream.next());
    } catch (IOException e) {
      fail(e.getMessage());
    }
  }

  private char[] getRussianCharSet() {
    int length = RussianCharsets.UnicodeRussian.length;
    final char[] russianChars = new char[length + 10];

    System
        .arraycopy(RussianCharsets.UnicodeRussian, 0, russianChars, 0, length);
    russianChars[length++] = '0';
    russianChars[length++] = '1';
    russianChars[length++] = '2';
    russianChars[length++] = '3';
    russianChars[length++] = '4';
    russianChars[length++] = '5';
    russianChars[length++] = '6';
    russianChars[length++] = '7';
    russianChars[length++] = '8';
    russianChars[length] = '9';
    return russianChars;
  }
}


 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
OpenTeam.ru made changes - 19/Sep/07 05:02 AM
Field Original Value New Value
Description RussianAnalyzer's tokenizer skips numbers from input text, so that resulting token stream miss numbers. Problem can be solved by adding numbers to RussianCharsets.UnicodeRussian. See test case below for details.

{code:title=TestRussianAnalyzer.java|borderStyle=solid}

public class TestRussianAnalyzer extends TestCase {

  Reader reader = new StringReader("text 1000");

  public void testStemmer() {
    testAnalyzer(new RussianAnalyzer());
  }

  public void testFixedRussianAnalyzer() {
    testAnalyzer(new RussianAnalyzer(getRussianCharSet()));
  }

  private void testAnalyzer(RussianAnalyzer analyzer) {
    try {
      TokenStream stream = analyzer.tokenStream("text", reader);
      assertEquals("text", stream.next().termText());
      assertNotNull(stream.next());
    } catch (IOException e) {
      fail(e.getMessage());
    }
  }

  private char[] getRussianCharSet() {
    int length = RussianCharsets.UnicodeRussian.length;
    final char[] russianChars = new char[length + 10];

    System
        .arraycopy(RussianCharsets.UnicodeRussian, 0, russianChars, 0, length);
    russianChars[length++] = '0';
    russianChars[length++] = '1';
    russianChars[length++] = '2';
    russianChars[length++] = '3';
    russianChars[length++] = '4';
    russianChars[length++] = '5';
    russianChars[length++] = '6';
    russianChars[length++] = '7';
    russianChars[length++] = '8';
    russianChars[length] = '9';
    return russianChars;
  }
}

{code}
RussianAnalyzer's tokenizer skips numbers from input text, so that resulting token stream miss numbers. Problem can be solved by adding numbers to RussianCharsets.UnicodeRussian. See test case below for details.

{code:title=TestRussianAnalyzer.java|borderStyle=solid}

public class TestRussianAnalyzer extends TestCase {

  Reader reader = new StringReader("text 1000");

  // test FAILS
  public void testStemmer() {
    testAnalyzer(new RussianAnalyzer());
  }

  // test PASSES
  public void testFixedRussianAnalyzer() {
    testAnalyzer(new RussianAnalyzer(getRussianCharSet()));
  }

  private void testAnalyzer(RussianAnalyzer analyzer) {
    try {
      TokenStream stream = analyzer.tokenStream("text", reader);
      assertEquals("text", stream.next().termText());
      assertNotNull(stream.next());
    } catch (IOException e) {
      fail(e.getMessage());
    }
  }

  private char[] getRussianCharSet() {
    int length = RussianCharsets.UnicodeRussian.length;
    final char[] russianChars = new char[length + 10];

    System
        .arraycopy(RussianCharsets.UnicodeRussian, 0, russianChars, 0, length);
    russianChars[length++] = '0';
    russianChars[length++] = '1';
    russianChars[length++] = '2';
    russianChars[length++] = '3';
    russianChars[length++] = '4';
    russianChars[length++] = '5';
    russianChars[length++] = '6';
    russianChars[length++] = '7';
    russianChars[length++] = '8';
    russianChars[length] = '9';
    return russianChars;
  }
}

{code}
OpenTeam.ru made changes - 19/Sep/07 07:54 AM
Attachment RussianCharsets.java.patch [ 12366159 ]
OpenTeam.ru made changes - 19/Sep/07 07:57 AM
Summary RussianAnalyzer's tokenizer skips numbers from input text, [PATCH] RussianAnalyzer's tokenizer skips numbers from input text,
Otis Gospodnetic made changes - 17/Feb/08 08:22 AM
Lucene Fields [New] [New, Patch Available]
Assignee Otis Gospodnetic [ otis ]
Dmitry Lihachev made changes - 19/Feb/08 04:04 AM
Attachment TestRussianAnalyzer.java.patch [ 12375880 ]
Otis Gospodnetic made changes - 14/May/08 05:38 AM
Status Open [ 1 ] Resolved [ 5 ]
Resolution Fixed [ 1 ]