Lucene - Core
  1. Lucene - Core
  2. LUCENE-1227

NGramTokenizer to handle more than 1024 chars

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Current NGramTokenizer can't handle character stream that is longer than 1024. This is too short for non-whitespace-separated languages.

      I created a patch for this issues.

      1. NGramTokenizer.patch
        3 kB
        Hiroaki Kawai
      2. NGramTokenizer.patch
        3 kB
        Hiroaki Kawai
      3. LUCENE-1227.patch
        12 kB
        Hiroaki Kawai

        Issue Links

          Activity

          Hide
          Hiroaki Kawai added a comment -
          Show
          Hiroaki Kawai added a comment - LUCENE-1227 's NGramTokenizer.patch will also fix https://issues.apache.org/jira/browse/LUCENE-1225
          Hide
          Hiroaki Kawai added a comment -

          bugfix that I made a mistake about char array addressing.

          Show
          Hiroaki Kawai added a comment - bugfix that I made a mistake about char array addressing.
          Hide
          Grant Ingersoll added a comment -

          Hi Hiroaki,

          Thanks for the patch. Can you add unit tests for your patch?

          Show
          Grant Ingersoll added a comment - Hi Hiroaki, Thanks for the patch. Can you add unit tests for your patch?
          Hide
          Hiroaki Kawai added a comment -

          NGramTokenizerTest#testIndexAndQueryLong will test this issues.

          Show
          Hiroaki Kawai added a comment - NGramTokenizerTest#testIndexAndQueryLong will test this issues.
          Hide
          Otis Gospodnetic added a comment -

          Thanks for the test and for addressing this!

          Could you add some examples for NO_OPTIMIZE and QUERY_OPTIMIZE? I can't tell from looking at the patch what those are about. Also, note how existing variables use camelCaseLikeThis. It would be good to stick to the same pattern (instead of bufflen, buffpos, etc.), as well as to the existing style (e.g. space between if and open paren, spaces around == and =, etc.)

          I'll commit as soon as you make these changes, assuming you can make them. Thank you.

          Show
          Otis Gospodnetic added a comment - Thanks for the test and for addressing this! Could you add some examples for NO_OPTIMIZE and QUERY_OPTIMIZE? I can't tell from looking at the patch what those are about. Also, note how existing variables use camelCaseLikeThis. It would be good to stick to the same pattern (instead of bufflen, buffpos, etc.), as well as to the existing style (e.g. space between if and open paren, spaces around == and =, etc.) I'll commit as soon as you make these changes, assuming you can make them. Thank you.
          Hide
          Michael Dodsworth added a comment -

          Any progress on getting this patch into a release? I can take a look if nobody else is.

          Show
          Michael Dodsworth added a comment - Any progress on getting this patch into a release? I can take a look if nobody else is.
          Hide
          Grant Ingersoll added a comment -

          Yes, please do have a look and let us know what you think.

          Show
          Grant Ingersoll added a comment - Yes, please do have a look and let us know what you think.
          Hide
          Torao Takami added a comment - - edited

          I avoided this problem by putting NGramFilterFactory and WhitespaceTokenizerFactory together in Solr schema.

          <tokenizer class="solr.NGramTokenizerFactory" maxGramSize="2" minGramSize="2" />

          is written to:

          <filter class="solr.NGramFilterFactory" minGramSize="2" maxGramSize="2"/>
          <tokenizer class="solr.WhitespaceTokenizerFactory"/>

          Platform: Solr 1.4.1

          Show
          Torao Takami added a comment - - edited I avoided this problem by putting NGramFilterFactory and WhitespaceTokenizerFactory together in Solr schema. <tokenizer class="solr.NGramTokenizerFactory" maxGramSize="2" minGramSize="2" /> is written to: <filter class="solr.NGramFilterFactory" minGramSize="2" maxGramSize="2"/> <tokenizer class="solr.WhitespaceTokenizerFactory"/> Platform: Solr 1.4.1
          Hide
          Raimund Merkert added a comment -

          For me, this also works (for my purposes, at least):

          String str = <read contents of reader to string>

          TokenStream tokens = new KeywordTokenizer(new StringReader(str.trim());
          tokens= new NGramTokenFilter(tokens, minNGram, maxNGram);

          Show
          Raimund Merkert added a comment - For me, this also works (for my purposes, at least): String str = <read contents of reader to string> TokenStream tokens = new KeywordTokenizer(new StringReader(str.trim()); tokens= new NGramTokenFilter(tokens, minNGram, maxNGram);
          Hide
          Harald Wellmann added a comment -

          As long as this issue is not fixed, please mention the 1024 character truncation in the Javadoc.

          The combination of KeywordTokenizer and NGramTokenFilter does not scale well for large inputs, as KeywordTokenizer reads the entire input stream into a character buffer.

          Show
          Harald Wellmann added a comment - As long as this issue is not fixed, please mention the 1024 character truncation in the Javadoc. The combination of KeywordTokenizer and NGramTokenFilter does not scale well for large inputs, as KeywordTokenizer reads the entire input stream into a character buffer.
          Hide
          David Byrne added a comment -

          I proposed a patch that fixes this problem on another ticket:

          https://issues.apache.org/jira/browse/LUCENE-2947

          Show
          David Byrne added a comment - I proposed a patch that fixes this problem on another ticket: https://issues.apache.org/jira/browse/LUCENE-2947
          Hide
          Adrien Grand added a comment -

          LUCENE-4955 fixed this issue.

          Show
          Adrien Grand added a comment - LUCENE-4955 fixed this issue.
          Hide
          Adrien Grand added a comment -

          David, sorry I didn't know about your patch and happened to fix this issue as part of LUCENE-4955. Your patch seems to operate very similarly and adds supports for whitespace collapsing, is that correct? Don't hesitate to tell me if you think the current implementation needs improvements.

          Show
          Adrien Grand added a comment - David, sorry I didn't know about your patch and happened to fix this issue as part of LUCENE-4955 . Your patch seems to operate very similarly and adds supports for whitespace collapsing, is that correct? Don't hesitate to tell me if you think the current implementation needs improvements.

            People

            • Assignee:
              Unassigned
              Reporter:
              Hiroaki Kawai
            • Votes:
              3 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development