Issue Details (XML | Word | Printable)

Key: LUCENE-1003
Type: Bug Bug
Status: Resolved Resolved
Resolution: Fixed
Priority: Major Major
Assignee: Otis Gospodnetic
Reporter: OpenTeam.ru
Votes: 1
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Lucene - Java

[PATCH] RussianAnalyzer's tokenizer skips numbers from input text,

Created: 19/Sep/07 05:00 AM   Updated: 14/May/08 05:38 AM
Return to search
Component/s: Analysis
Affects Version/s: 2.2
Fix Version/s: None

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works RussianCharsets.java.patch 2007-09-19 07:54 AM OpenTeam.ru 1 kB
Text File Licensed for inclusion in ASF works TestRussianAnalyzer.java.patch 2008-02-19 04:04 AM Dmitry Lihachev 0.7 kB

Lucene Fields: Patch Available, New
Resolution Date: 14/May/08 05:38 AM


 Description  « Hide
RussianAnalyzer's tokenizer skips numbers from input text, so that resulting token stream miss numbers. Problem can be solved by adding numbers to RussianCharsets.UnicodeRussian. See test case below for details.
TestRussianAnalyzer.java
public class TestRussianAnalyzer extends TestCase {

  Reader reader = new StringReader("text 1000");

  // test FAILS
  public void testStemmer() {
    testAnalyzer(new RussianAnalyzer());
  }

  // test PASSES
  public void testFixedRussianAnalyzer() {
    testAnalyzer(new RussianAnalyzer(getRussianCharSet()));
  }

  private void testAnalyzer(RussianAnalyzer analyzer) {
    try {
      TokenStream stream = analyzer.tokenStream("text", reader);
      assertEquals("text", stream.next().termText());
      assertNotNull(stream.next());
    } catch (IOException e) {
      fail(e.getMessage());
    }
  }

  private char[] getRussianCharSet() {
    int length = RussianCharsets.UnicodeRussian.length;
    final char[] russianChars = new char[length + 10];

    System
        .arraycopy(RussianCharsets.UnicodeRussian, 0, russianChars, 0, length);
    russianChars[length++] = '0';
    russianChars[length++] = '1';
    russianChars[length++] = '2';
    russianChars[length++] = '3';
    russianChars[length++] = '4';
    russianChars[length++] = '5';
    russianChars[length++] = '6';
    russianChars[length++] = '7';
    russianChars[length++] = '8';
    russianChars[length] = '9';
    return russianChars;
  }
}


 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Nick Menere added a comment - 19/Sep/07 05:19 AM
Yeah,
I raised this on the dev list a few months ago and didn't get much response.

I think I might even be responsible for that code above. It was meant more as hack to get a customer up and running.

Cheers,
Nick


OpenTeam.ru added a comment - 19/Sep/07 07:50 AM
Yeah, Nick, the code above was taken from your JIRA issue. We wasn't able to find similar issue in Lucene issue tracker. We're using Lucene a lot so we needed this bug fixed in the core.

OpenTeam.ru added a comment - 19/Sep/07 07:54 AM
Patch that adds numbers to RussianCharset
usage: patch RussianCharsets.java < RussianCharsets.java.patch

Grant Ingersoll added a comment - 19/Sep/07 05:27 PM
minor nit, can you add the test case to the patch as well?

Otis Gospodnetic added a comment - 17/Feb/08 08:22 AM
TUSUR OpenTeam: would it be possible to get a unit test, too?

Dmitry Lihachev added a comment - 19/Feb/08 04:04 AM
Patch that adds new test to the TestRussianAnalyzer
usage:
patch TestRussianAnalyzer.java < TestRussianAnalyzer.java.patch