Lucene - Core
  1. Lucene - Core
  2. LUCENE-1003

[PATCH] RussianAnalyzer's tokenizer skips numbers from input text,

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.2
    • Fix Version/s: None
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      RussianAnalyzer's tokenizer skips numbers from input text, so that resulting token stream miss numbers. Problem can be solved by adding numbers to RussianCharsets.UnicodeRussian. See test case below for details.

      TestRussianAnalyzer.java
      public class TestRussianAnalyzer extends TestCase {
      
        Reader reader = new StringReader("text 1000");
      
        // test FAILS
        public void testStemmer() {
          testAnalyzer(new RussianAnalyzer());
        }
      
        // test PASSES
        public void testFixedRussianAnalyzer() {
          testAnalyzer(new RussianAnalyzer(getRussianCharSet()));
        }
      
        private void testAnalyzer(RussianAnalyzer analyzer) {
          try {
            TokenStream stream = analyzer.tokenStream("text", reader);
            assertEquals("text", stream.next().termText());
            assertNotNull(stream.next());
          } catch (IOException e) {
            fail(e.getMessage());
          }
        }
      
        private char[] getRussianCharSet() {
          int length = RussianCharsets.UnicodeRussian.length;
          final char[] russianChars = new char[length + 10];
      
          System
              .arraycopy(RussianCharsets.UnicodeRussian, 0, russianChars, 0, length);
          russianChars[length++] = '0';
          russianChars[length++] = '1';
          russianChars[length++] = '2';
          russianChars[length++] = '3';
          russianChars[length++] = '4';
          russianChars[length++] = '5';
          russianChars[length++] = '6';
          russianChars[length++] = '7';
          russianChars[length++] = '8';
          russianChars[length] = '9';
          return russianChars;
        }
      }
      
      
      1. TestRussianAnalyzer.java.patch
        0.7 kB
        Dmitry Lihachev
      2. RussianCharsets.java.patch
        1 kB
        OpenTeam.ru

        Activity

        Hide
        Nick Menere added a comment -

        Yeah,
        I raised this on the dev list a few months ago and didn't get much response.

        I think I might even be responsible for that code above. It was meant more as hack to get a customer up and running.

        Cheers,
        Nick

        Show
        Nick Menere added a comment - Yeah, I raised this on the dev list a few months ago and didn't get much response. I think I might even be responsible for that code above. It was meant more as hack to get a customer up and running . Cheers, Nick
        Hide
        OpenTeam.ru added a comment -

        Yeah, Nick, the code above was taken from your JIRA issue. We wasn't able to find similar issue in Lucene issue tracker. We're using Lucene a lot so we needed this bug fixed in the core.

        Show
        OpenTeam.ru added a comment - Yeah, Nick, the code above was taken from your JIRA issue. We wasn't able to find similar issue in Lucene issue tracker. We're using Lucene a lot so we needed this bug fixed in the core.
        Hide
        OpenTeam.ru added a comment -

        Patch that adds numbers to RussianCharset
        usage: patch RussianCharsets.java < RussianCharsets.java.patch

        Show
        OpenTeam.ru added a comment - Patch that adds numbers to RussianCharset usage: patch RussianCharsets.java < RussianCharsets.java.patch
        Hide
        Grant Ingersoll added a comment -

        minor nit, can you add the test case to the patch as well?

        Show
        Grant Ingersoll added a comment - minor nit, can you add the test case to the patch as well?
        Hide
        Otis Gospodnetic added a comment -

        TUSUR OpenTeam: would it be possible to get a unit test, too?

        Show
        Otis Gospodnetic added a comment - TUSUR OpenTeam: would it be possible to get a unit test, too?
        Hide
        Dmitry Lihachev added a comment -

        Patch that adds new test to the TestRussianAnalyzer
        usage:
        patch TestRussianAnalyzer.java < TestRussianAnalyzer.java.patch

        Show
        Dmitry Lihachev added a comment - Patch that adds new test to the TestRussianAnalyzer usage: patch TestRussianAnalyzer.java < TestRussianAnalyzer.java.patch

          People

          • Assignee:
            Otis Gospodnetic
            Reporter:
            OpenTeam.ru
          • Votes:
            1 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development