Issue Details (XML | Word | Printable)

Key: LUCENE-1003
Type: Bug Bug
Status: Resolved Resolved
Resolution: Fixed
Priority: Major Major
Assignee: Otis Gospodnetic
Reporter: OpenTeam.ru
Votes: 1
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Lucene - Java

[PATCH] RussianAnalyzer's tokenizer skips numbers from input text,

Created: 19/Sep/07 05:00 AM   Updated: 14/May/08 05:38 AM
Return to search
Component/s: Analysis
Affects Version/s: 2.2
Fix Version/s: None

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works RussianCharsets.java.patch 2007-09-19 07:54 AM OpenTeam.ru 1 kB
Text File Licensed for inclusion in ASF works TestRussianAnalyzer.java.patch 2008-02-19 04:04 AM Dmitry Lihachev 0.7 kB

Lucene Fields: Patch Available, New
Resolution Date: 14/May/08 05:38 AM


 Description  « Hide
RussianAnalyzer's tokenizer skips numbers from input text, so that resulting token stream miss numbers. Problem can be solved by adding numbers to RussianCharsets.UnicodeRussian. See test case below for details.
TestRussianAnalyzer.java
public class TestRussianAnalyzer extends TestCase {

  Reader reader = new StringReader("text 1000");

  // test FAILS
  public void testStemmer() {
    testAnalyzer(new RussianAnalyzer());
  }

  // test PASSES
  public void testFixedRussianAnalyzer() {
    testAnalyzer(new RussianAnalyzer(getRussianCharSet()));
  }

  private void testAnalyzer(RussianAnalyzer analyzer) {
    try {
      TokenStream stream = analyzer.tokenStream("text", reader);
      assertEquals("text", stream.next().termText());
      assertNotNull(stream.next());
    } catch (IOException e) {
      fail(e.getMessage());
    }
  }

  private char[] getRussianCharSet() {
    int length = RussianCharsets.UnicodeRussian.length;
    final char[] russianChars = new char[length + 10];

    System
        .arraycopy(RussianCharsets.UnicodeRussian, 0, russianChars, 0, length);
    russianChars[length++] = '0';
    russianChars[length++] = '1';
    russianChars[length++] = '2';
    russianChars[length++] = '3';
    russianChars[length++] = '4';
    russianChars[length++] = '5';
    russianChars[length++] = '6';
    russianChars[length++] = '7';
    russianChars[length++] = '8';
    russianChars[length] = '9';
    return russianChars;
  }
}


 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
OpenTeam.ru made changes - 19/Sep/07 05:02 AM
Field Original Value New Value
Description RussianAnalyzer's tokenizer skips numbers from input text, so that resulting token stream miss numbers. Problem can be solved by adding numbers to RussianCharsets.UnicodeRussian. See test case below for details.

{code:title=TestRussianAnalyzer.java|borderStyle=solid}

public class TestRussianAnalyzer extends TestCase {

  Reader reader = new StringReader("text 1000");

  public void testStemmer() {
    testAnalyzer(new RussianAnalyzer());
  }

  public void testFixedRussianAnalyzer() {
    testAnalyzer(new RussianAnalyzer(getRussianCharSet()));
  }

  private void testAnalyzer(RussianAnalyzer analyzer) {
    try {
      TokenStream stream = analyzer.tokenStream("text", reader);
      assertEquals("text", stream.next().termText());
      assertNotNull(stream.next());
    } catch (IOException e) {
      fail(e.getMessage());
    }
  }

  private char[] getRussianCharSet() {
    int length = RussianCharsets.UnicodeRussian.length;
    final char[] russianChars = new char[length + 10];

    System
        .arraycopy(RussianCharsets.UnicodeRussian, 0, russianChars, 0, length);
    russianChars[length++] = '0';
    russianChars[length++] = '1';
    russianChars[length++] = '2';
    russianChars[length++] = '3';
    russianChars[length++] = '4';
    russianChars[length++] = '5';
    russianChars[length++] = '6';
    russianChars[length++] = '7';
    russianChars[length++] = '8';
    russianChars[length] = '9';
    return russianChars;
  }
}

{code}
RussianAnalyzer's tokenizer skips numbers from input text, so that resulting token stream miss numbers. Problem can be solved by adding numbers to RussianCharsets.UnicodeRussian. See test case below for details.

{code:title=TestRussianAnalyzer.java|borderStyle=solid}

public class TestRussianAnalyzer extends TestCase {

  Reader reader = new StringReader("text 1000");

  // test FAILS
  public void testStemmer() {
    testAnalyzer(new RussianAnalyzer());
  }

  // test PASSES
  public void testFixedRussianAnalyzer() {
    testAnalyzer(new RussianAnalyzer(getRussianCharSet()));
  }

  private void testAnalyzer(RussianAnalyzer analyzer) {
    try {
      TokenStream stream = analyzer.tokenStream("text", reader);
      assertEquals("text", stream.next().termText());
      assertNotNull(stream.next());
    } catch (IOException e) {
      fail(e.getMessage());
    }
  }

  private char[] getRussianCharSet() {
    int length = RussianCharsets.UnicodeRussian.length;
    final char[] russianChars = new char[length + 10];

    System
        .arraycopy(RussianCharsets.UnicodeRussian, 0, russianChars, 0, length);
    russianChars[length++] = '0';
    russianChars[length++] = '1';
    russianChars[length++] = '2';
    russianChars[length++] = '3';
    russianChars[length++] = '4';
    russianChars[length++] = '5';
    russianChars[length++] = '6';
    russianChars[length++] = '7';
    russianChars[length++] = '8';
    russianChars[length] = '9';
    return russianChars;
  }
}

{code}
Nick Menere added a comment - 19/Sep/07 05:19 AM
Yeah,
I raised this on the dev list a few months ago and didn't get much response.

I think I might even be responsible for that code above. It was meant more as hack to get a customer up and running.

Cheers,
Nick


OpenTeam.ru added a comment - 19/Sep/07 07:50 AM
Yeah, Nick, the code above was taken from your JIRA issue. We wasn't able to find similar issue in Lucene issue tracker. We're using Lucene a lot so we needed this bug fixed in the core.

OpenTeam.ru added a comment - 19/Sep/07 07:54 AM
Patch that adds numbers to RussianCharset
usage: patch RussianCharsets.java < RussianCharsets.java.patch

OpenTeam.ru made changes - 19/Sep/07 07:54 AM
Attachment RussianCharsets.java.patch [ 12366159 ]
OpenTeam.ru made changes - 19/Sep/07 07:57 AM
Summary RussianAnalyzer's tokenizer skips numbers from input text, [PATCH] RussianAnalyzer's tokenizer skips numbers from input text,
Grant Ingersoll added a comment - 19/Sep/07 05:27 PM
minor nit, can you add the test case to the patch as well?

Otis Gospodnetic added a comment - 17/Feb/08 08:22 AM
TUSUR OpenTeam: would it be possible to get a unit test, too?

Otis Gospodnetic made changes - 17/Feb/08 08:22 AM
Lucene Fields [New] [New, Patch Available]
Assignee Otis Gospodnetic [ otis ]
Dmitry Lihachev added a comment - 19/Feb/08 04:04 AM
Patch that adds new test to the TestRussianAnalyzer
usage:
patch TestRussianAnalyzer.java < TestRussianAnalyzer.java.patch

Dmitry Lihachev made changes - 19/Feb/08 04:04 AM
Attachment TestRussianAnalyzer.java.patch [ 12375880 ]
Repository Revision Date User Message
ASF #656111 Wed May 14 05:37:45 UTC 2008 otis LUCENE-1003: Don't let RussianAnalyzer drop numbers.
Files Changed
MODIFY /lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/ru/RussianCharsets.java
MODIFY /lucene/java/trunk/contrib/analyzers/src/test/org/apache/lucene/analysis/ru/TestRussianAnalyzer.java
MODIFY /lucene/java/trunk/CHANGES.txt

Otis Gospodnetic made changes - 14/May/08 05:38 AM
Status Open [ 1 ] Resolved [ 5 ]
Resolution Fixed [ 1 ]