Lucene - Core
  1. Lucene - Core
  2. LUCENE-1003

[PATCH] RussianAnalyzer's tokenizer skips numbers from input text,

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.2
    • Fix Version/s: None
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      RussianAnalyzer's tokenizer skips numbers from input text, so that resulting token stream miss numbers. Problem can be solved by adding numbers to RussianCharsets.UnicodeRussian. See test case below for details.

      TestRussianAnalyzer.java
      public class TestRussianAnalyzer extends TestCase {
      
        Reader reader = new StringReader("text 1000");
      
        // test FAILS
        public void testStemmer() {
          testAnalyzer(new RussianAnalyzer());
        }
      
        // test PASSES
        public void testFixedRussianAnalyzer() {
          testAnalyzer(new RussianAnalyzer(getRussianCharSet()));
        }
      
        private void testAnalyzer(RussianAnalyzer analyzer) {
          try {
            TokenStream stream = analyzer.tokenStream("text", reader);
            assertEquals("text", stream.next().termText());
            assertNotNull(stream.next());
          } catch (IOException e) {
            fail(e.getMessage());
          }
        }
      
        private char[] getRussianCharSet() {
          int length = RussianCharsets.UnicodeRussian.length;
          final char[] russianChars = new char[length + 10];
      
          System
              .arraycopy(RussianCharsets.UnicodeRussian, 0, russianChars, 0, length);
          russianChars[length++] = '0';
          russianChars[length++] = '1';
          russianChars[length++] = '2';
          russianChars[length++] = '3';
          russianChars[length++] = '4';
          russianChars[length++] = '5';
          russianChars[length++] = '6';
          russianChars[length++] = '7';
          russianChars[length++] = '8';
          russianChars[length] = '9';
          return russianChars;
        }
      }
      
      
      1. TestRussianAnalyzer.java.patch
        0.7 kB
        Dmitry Lihachev
      2. RussianCharsets.java.patch
        1 kB
        OpenTeam.ru

        Activity

        OpenTeam.ru created issue -
        OpenTeam.ru made changes -
        Field Original Value New Value
        Description RussianAnalyzer's tokenizer skips numbers from input text, so that resulting token stream miss numbers. Problem can be solved by adding numbers to RussianCharsets.UnicodeRussian. See test case below for details.

        {code:title=TestRussianAnalyzer.java|borderStyle=solid}

        public class TestRussianAnalyzer extends TestCase {

          Reader reader = new StringReader("text 1000");

          public void testStemmer() {
            testAnalyzer(new RussianAnalyzer());
          }

          public void testFixedRussianAnalyzer() {
            testAnalyzer(new RussianAnalyzer(getRussianCharSet()));
          }

          private void testAnalyzer(RussianAnalyzer analyzer) {
            try {
              TokenStream stream = analyzer.tokenStream("text", reader);
              assertEquals("text", stream.next().termText());
              assertNotNull(stream.next());
            } catch (IOException e) {
              fail(e.getMessage());
            }
          }

          private char[] getRussianCharSet() {
            int length = RussianCharsets.UnicodeRussian.length;
            final char[] russianChars = new char[length + 10];

            System
                .arraycopy(RussianCharsets.UnicodeRussian, 0, russianChars, 0, length);
            russianChars[length++] = '0';
            russianChars[length++] = '1';
            russianChars[length++] = '2';
            russianChars[length++] = '3';
            russianChars[length++] = '4';
            russianChars[length++] = '5';
            russianChars[length++] = '6';
            russianChars[length++] = '7';
            russianChars[length++] = '8';
            russianChars[length] = '9';
            return russianChars;
          }
        }

        {code}
        RussianAnalyzer's tokenizer skips numbers from input text, so that resulting token stream miss numbers. Problem can be solved by adding numbers to RussianCharsets.UnicodeRussian. See test case below for details.

        {code:title=TestRussianAnalyzer.java|borderStyle=solid}

        public class TestRussianAnalyzer extends TestCase {

          Reader reader = new StringReader("text 1000");

          // test FAILS
          public void testStemmer() {
            testAnalyzer(new RussianAnalyzer());
          }

          // test PASSES
          public void testFixedRussianAnalyzer() {
            testAnalyzer(new RussianAnalyzer(getRussianCharSet()));
          }

          private void testAnalyzer(RussianAnalyzer analyzer) {
            try {
              TokenStream stream = analyzer.tokenStream("text", reader);
              assertEquals("text", stream.next().termText());
              assertNotNull(stream.next());
            } catch (IOException e) {
              fail(e.getMessage());
            }
          }

          private char[] getRussianCharSet() {
            int length = RussianCharsets.UnicodeRussian.length;
            final char[] russianChars = new char[length + 10];

            System
                .arraycopy(RussianCharsets.UnicodeRussian, 0, russianChars, 0, length);
            russianChars[length++] = '0';
            russianChars[length++] = '1';
            russianChars[length++] = '2';
            russianChars[length++] = '3';
            russianChars[length++] = '4';
            russianChars[length++] = '5';
            russianChars[length++] = '6';
            russianChars[length++] = '7';
            russianChars[length++] = '8';
            russianChars[length] = '9';
            return russianChars;
          }
        }

        {code}
        Hide
        Nick Menere added a comment -

        Yeah,
        I raised this on the dev list a few months ago and didn't get much response.

        I think I might even be responsible for that code above. It was meant more as hack to get a customer up and running.

        Cheers,
        Nick

        Show
        Nick Menere added a comment - Yeah, I raised this on the dev list a few months ago and didn't get much response. I think I might even be responsible for that code above. It was meant more as hack to get a customer up and running . Cheers, Nick
        Hide
        OpenTeam.ru added a comment -

        Yeah, Nick, the code above was taken from your JIRA issue. We wasn't able to find similar issue in Lucene issue tracker. We're using Lucene a lot so we needed this bug fixed in the core.

        Show
        OpenTeam.ru added a comment - Yeah, Nick, the code above was taken from your JIRA issue. We wasn't able to find similar issue in Lucene issue tracker. We're using Lucene a lot so we needed this bug fixed in the core.
        Hide
        OpenTeam.ru added a comment -

        Patch that adds numbers to RussianCharset
        usage: patch RussianCharsets.java < RussianCharsets.java.patch

        Show
        OpenTeam.ru added a comment - Patch that adds numbers to RussianCharset usage: patch RussianCharsets.java < RussianCharsets.java.patch
        OpenTeam.ru made changes -
        Attachment RussianCharsets.java.patch [ 12366159 ]
        OpenTeam.ru made changes -
        Summary RussianAnalyzer's tokenizer skips numbers from input text, [PATCH] RussianAnalyzer's tokenizer skips numbers from input text,
        Hide
        Grant Ingersoll added a comment -

        minor nit, can you add the test case to the patch as well?

        Show
        Grant Ingersoll added a comment - minor nit, can you add the test case to the patch as well?
        Hide
        Otis Gospodnetic added a comment -

        TUSUR OpenTeam: would it be possible to get a unit test, too?

        Show
        Otis Gospodnetic added a comment - TUSUR OpenTeam: would it be possible to get a unit test, too?
        Otis Gospodnetic made changes -
        Lucene Fields [New] [New, Patch Available]
        Assignee Otis Gospodnetic [ otis ]
        Hide
        Dmitry Lihachev added a comment -

        Patch that adds new test to the TestRussianAnalyzer
        usage:
        patch TestRussianAnalyzer.java < TestRussianAnalyzer.java.patch

        Show
        Dmitry Lihachev added a comment - Patch that adds new test to the TestRussianAnalyzer usage: patch TestRussianAnalyzer.java < TestRussianAnalyzer.java.patch
        Dmitry Lihachev made changes -
        Attachment TestRussianAnalyzer.java.patch [ 12375880 ]
        Otis Gospodnetic made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Mark Thomas made changes -
        Workflow jira [ 12413257 ] Default workflow, editable Closed status [ 12561812 ]
        Mark Thomas made changes -
        Workflow Default workflow, editable Closed status [ 12561812 ] jira [ 12584671 ]
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Resolved Resolved
        238d 37m 1 Otis Gospodnetic 14/May/08 06:38

          People

          • Assignee:
            Otis Gospodnetic
            Reporter:
            OpenTeam.ru
          • Votes:
            1 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development