Lucene - Core
  1. Lucene - Core
  2. LUCENE-759

Add n-gram tokenizers to contrib/analyzers

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      It would be nice to have some n-gram-capable tokenizers in contrib/analyzers. Patch coming shortly.

      1. LUCENE-759.patch
        11 kB
        Otis Gospodnetic
      2. LUCENE-759.patch
        12 kB
        Otis Gospodnetic
      3. LUCENE-759.patch
        15 kB
        Otis Gospodnetic
      4. LUCENE-759-filters.patch
        19 kB
        Otis Gospodnetic

        Issue Links

          Activity

          Hide
          Otis Gospodnetic added a comment -

          This should have been marked Fixed a while back.

          Show
          Otis Gospodnetic added a comment - This should have been marked Fixed a while back.
          Hide
          Daniel Naber added a comment -

          Can this issue be closed or is there anything still open?

          Show
          Daniel Naber added a comment - Can this issue be closed or is there anything still open?
          Hide
          Otis Gospodnetic added a comment -

          Oh, look at that!

          [otis@localhost contrib]$ svn st
          A analyzers/src/java/org/apache/lucene/analysis/ngram/EdgeNGramTokenFilter.java
          A analyzers/src/java/org/apache/lucene/analysis/ngram/NGramTokenFilter.java


          It's in the repo now. Sorry about that!

          Show
          Otis Gospodnetic added a comment - Oh, look at that! [otis@localhost contrib] $ svn st A analyzers/src/java/org/apache/lucene/analysis/ngram/EdgeNGramTokenFilter.java A analyzers/src/java/org/apache/lucene/analysis/ngram/NGramTokenFilter.java It's in the repo now. Sorry about that!
          Hide
          Hoss Man added a comment -

          Ack! ... i'm sorry i completely missread Patrick's question.

          ngram Tokenizers have been commited – but there are no ngram TokenFilters ... there are tests for TokenFilters Otis commited on March2, but those tests do't currently compile/run without the TokenFilter's themselves.

          Otis: do you have these TokenFilter's in your working directory that you perhaps forgot to svn add before committing?

          Show
          Hoss Man added a comment - Ack! ... i'm sorry i completely missread Patrick's question. ngram Tokenizers have been commited – but there are no ngram TokenFilters ... there are tests for TokenFilters Otis commited on March2, but those tests do't currently compile/run without the TokenFilter's themselves. Otis: do you have these TokenFilter's in your working directory that you perhaps forgot to svn add before committing?
          Hide
          Hoss Man added a comment -

          Otis's most recent attachment contains only tests .. but previous attachemnts had implementations.

          all of which have been commited under contrib/analyzers

          (tip: if you click "All" at the top of the list of comments in Jira, you see every modification related to this issue, including subversion commits that Jira detects are related to the issue based on the commit message)

          Show
          Hoss Man added a comment - Otis's most recent attachment contains only tests .. but previous attachemnts had implementations. all of which have been commited under contrib/analyzers (tip: if you click "All" at the top of the list of comments in Jira, you see every modification related to this issue, including subversion commits that Jira detects are related to the issue based on the commit message)
          Hide
          Patrick Turcotte added a comment -

          Is it just me or are the NGramTokenFilter and EdgeNGramTokenFilter class not committed to SVN and not in the patch either?

          NGramTokenFilterTest and EdgeNGramTokenFilterTest are referring to them, but I can not seem to find them.

          Thanks and keep the good work.

          Patrick

          Show
          Patrick Turcotte added a comment - Is it just me or are the NGramTokenFilter and EdgeNGramTokenFilter class not committed to SVN and not in the patch either? NGramTokenFilterTest and EdgeNGramTokenFilterTest are referring to them, but I can not seem to find them. Thanks and keep the good work. Patrick
          Hide
          Otis Gospodnetic added a comment -

          Ah, didn't see your comments here earlier, Doron. Yes, I think you are correct about the 1024 limit - when I wrote that Tokenizer I was thinking TokenFilter, and thus I was thinking that that input Reader represents a Token, which was wrong. So, I thought, "oh, 1024 chars/token, that will be enough". I ended up needing TokenFilters for SOLR-81, so that's what I checked in. Those operate on tokens and don't have the 1024 limitation.

          Anyhow, feel free to slap your test + the fix in and thanks for checking!

          Show
          Otis Gospodnetic added a comment - Ah, didn't see your comments here earlier, Doron. Yes, I think you are correct about the 1024 limit - when I wrote that Tokenizer I was thinking TokenFilter, and thus I was thinking that that input Reader represents a Token, which was wrong. So, I thought, "oh, 1024 chars/token, that will be enough". I ended up needing TokenFilters for SOLR-81 , so that's what I checked in. Those operate on tokens and don't have the 1024 limitation. Anyhow, feel free to slap your test + the fix in and thanks for checking!
          Hide
          Doron Cohen added a comment -

          Hi Otis,

          > (and I think as result the input stream would remain dangling open)

          I take this part back - closing tokenStream would close the reader, and at least for the case that I thought of, invertDocument, the tokenStream is properly closed.

          Can you comment on the input length: is it correct to handle only the first 1024 characters?

          Thanks,
          Doron

          Show
          Doron Cohen added a comment - Hi Otis, > (and I think as result the input stream would remain dangling open) I take this part back - closing tokenStream would close the reader, and at least for the case that I thought of, invertDocument, the tokenStream is properly closed. Can you comment on the input length: is it correct to handle only the first 1024 characters? Thanks, Doron
          Hide
          Otis Gospodnetic added a comment -

          N-gram-producting TokenFilters for Karl's mom.

          Show
          Otis Gospodnetic added a comment - N-gram-producting TokenFilters for Karl's mom.
          Hide
          Otis Gospodnetic added a comment -

          More goodies coming.

          Show
          Otis Gospodnetic added a comment - More goodies coming.
          Hide
          Doron Cohen added a comment -

          I have two comments/questions on the n-gram tokenizers:

          (1) Seems that only the first 1024 characters of the input are handled, and the rest is ignored (and I think as result the input stream would remain dangling open).

          If you add this test case:

          /**

          • Test that no ngrams are lost, even for really long inputs
          • @throws EXception
            */
            public void testLongerInput() throws Exception {
            int expectedNumTokens = 1024;
            int ngramLength = 2;
            // prepare long string
            StringBuffer sb = new StringBuffer();
            while (sb.length()<expectedNumTokens+ngramLength-1)
            sb.append('a');

          StringReader longStringReader = new StringReader (sb.toString());
          NGramTokenizer tokenizer = new NGramTokenizer(longStringReader, ngramLength, ngramLength);
          int numTokens = 0;
          Token token;
          while ((token = tokenizer.next())!=null)

          { numTokens++; assertEquals("aa",token.termText()); }

          assertEquals("wrong number of tokens",expectedNumTokens,numTokens);
          }

          With expectedNumTokens = 1023 it would pass, but any larger number would fail.

          (2) It seems safer to read the characters like this
          int n = input.read(chars);
          inStr = new String(chars, 0, n);
          (This way not counting on String.trim(), which does work, but worries me).

          Show
          Doron Cohen added a comment - I have two comments/questions on the n-gram tokenizers: (1) Seems that only the first 1024 characters of the input are handled, and the rest is ignored (and I think as result the input stream would remain dangling open). If you add this test case: /** Test that no ngrams are lost, even for really long inputs @throws EXception */ public void testLongerInput() throws Exception { int expectedNumTokens = 1024; int ngramLength = 2; // prepare long string StringBuffer sb = new StringBuffer(); while (sb.length()<expectedNumTokens+ngramLength-1) sb.append('a'); StringReader longStringReader = new StringReader (sb.toString()); NGramTokenizer tokenizer = new NGramTokenizer(longStringReader, ngramLength, ngramLength); int numTokens = 0; Token token; while ((token = tokenizer.next())!=null) { numTokens++; assertEquals("aa",token.termText()); } assertEquals("wrong number of tokens",expectedNumTokens,numTokens); } With expectedNumTokens = 1023 it would pass, but any larger number would fail. (2) It seems safer to read the characters like this int n = input.read(chars); inStr = new String(chars, 0, n); (This way not counting on String.trim(), which does work, but worries me).
          Hide
          Otis Gospodnetic added a comment -

          In SVN.

          Show
          Otis Gospodnetic added a comment - In SVN.
          Hide
          Otis Gospodnetic added a comment -

          Here is the proper version. This one is essentially the Lucene-n-gram-analyzer-specific Adam's patch from SOLR-81 + some passing unit tests I wrote to exercise the new n-gram range functionality.

          I'll commit this by the end of the week unless Adam spots a bug.

          Show
          Otis Gospodnetic added a comment - Here is the proper version. This one is essentially the Lucene-n-gram-analyzer-specific Adam's patch from SOLR-81 + some passing unit tests I wrote to exercise the new n-gram range functionality. I'll commit this by the end of the week unless Adam spots a bug.
          Hide
          Otis Gospodnetic added a comment -

          Damn, I think you are right! Once again, I'm making late night mistakes. When will I learn!?
          But I could take my code to NGramTokenizer then, at least.
          My bug remains, though..... got an idea for a fix?

          Show
          Otis Gospodnetic added a comment - Damn, I think you are right! Once again, I'm making late night mistakes. When will I learn!? But I could take my code to NGramTokenizer then, at least. My bug remains, though..... got an idea for a fix?
          Hide
          Adam Hiatt added a comment -

          Otis: this really isn't a bug. The min/max gram code I added only applied to the EdgeNGramTokenizer. I only want to generate edge n-grams between the range of sizes provided.

          For example, with the EdgeNGramTokenizer
          input: abcde
          minGram: 1
          maxGram: 3

          'a ab abc' is in fact what I intended to produce.

          I think it makes more sense for the functionality to which you referred to be located in NGramTokenizer.

          Show
          Adam Hiatt added a comment - Otis: this really isn't a bug. The min/max gram code I added only applied to the EdgeNGramTokenizer. I only want to generate edge n-grams between the range of sizes provided. For example, with the EdgeNGramTokenizer input: abcde minGram: 1 maxGram: 3 'a ab abc' is in fact what I intended to produce. I think it makes more sense for the functionality to which you referred to be located in NGramTokenizer.
          Hide
          Otis Gospodnetic added a comment -

          The modified tokenizer and the extended unit test.

          Show
          Otis Gospodnetic added a comment - The modified tokenizer and the extended unit test.
          Hide
          Otis Gospodnetic added a comment -

          Reopening, because I'm bringing in Adam Hiatt's modifications that he uploaded in a patch for SOLR-81. Adam's changes allow this tokenizer to create n-grams whose sizes are specified as a min-max range.

          This patch fixes a bug in Adam's code, but has another bug that I don't know how to fix now.
          Adam's bug:
          input: abcde
          minGram: 1
          maxGram: 3
          output: a ab abc – and this is where tokenizing stopped, which was wrong, it should have continued: b bc bcd c cd cde d de e

          Otis' bug:
          input: abcde
          minGeam: 1
          maxGram: 3
          output: e de cde d cd bcd c bc abc b ab – and this is where tokenizing stops, which is wrong, it should generate one more n-gram: a

          This bug won't hurt SOLR-81, but it should be fixed.

          Show
          Otis Gospodnetic added a comment - Reopening, because I'm bringing in Adam Hiatt's modifications that he uploaded in a patch for SOLR-81 . Adam's changes allow this tokenizer to create n-grams whose sizes are specified as a min-max range. This patch fixes a bug in Adam's code, but has another bug that I don't know how to fix now. Adam's bug: input: abcde minGram: 1 maxGram: 3 output: a ab abc – and this is where tokenizing stopped, which was wrong, it should have continued: b bc bcd c cd cde d de e Otis' bug: input: abcde minGeam: 1 maxGram: 3 output: e de cde d cd bcd c bc abc b ab – and this is where tokenizing stops, which is wrong, it should generate one more n-gram: a This bug won't hurt SOLR-81 , but it should be fixed.
          Hide
          Otis Gospodnetic added a comment -

          Unit tests pass, committed.

          Show
          Otis Gospodnetic added a comment - Unit tests pass, committed.
          Hide
          Otis Gospodnetic added a comment -

          Included:
          NGramTokenizer
          NGramTokenizerTest
          EdgeNGramTokenizer
          EdgeNGramTokenizerTest

          Show
          Otis Gospodnetic added a comment - Included: NGramTokenizer NGramTokenizerTest EdgeNGramTokenizer EdgeNGramTokenizerTest

            People

            • Assignee:
              Otis Gospodnetic
              Reporter:
              Otis Gospodnetic
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development