Lucene - Core
  1. Lucene - Core
  2. LUCENE-1227

NGramTokenizer to handle more than 1024 chars

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Current NGramTokenizer can't handle character stream that is longer than 1024. This is too short for non-whitespace-separated languages.

      I created a patch for this issues.

      1. LUCENE-1227.patch
        12 kB
        Hiroaki Kawai
      2. NGramTokenizer.patch
        3 kB
        Hiroaki Kawai
      3. NGramTokenizer.patch
        3 kB
        Hiroaki Kawai

        Issue Links

          Activity

          Hiroaki Kawai created issue -
          Hiroaki Kawai made changes -
          Field Original Value New Value
          Attachment NGramTokenizer.patch [ 12377756 ]
          Hide
          Hiroaki Kawai added a comment -
          Show
          Hiroaki Kawai added a comment - LUCENE-1227 's NGramTokenizer.patch will also fix https://issues.apache.org/jira/browse/LUCENE-1225
          Grant Ingersoll made changes -
          Link This issue incorporates LUCENE-1225 [ LUCENE-1225 ]
          Grant Ingersoll made changes -
          Assignee Grant Ingersoll [ gsingers ]
          Grant Ingersoll made changes -
          Priority Major [ 3 ] Minor [ 4 ]
          Lucene Fields [Patch Available, New] [New, Patch Available]
          Hide
          Hiroaki Kawai added a comment -

          bugfix that I made a mistake about char array addressing.

          Show
          Hiroaki Kawai added a comment - bugfix that I made a mistake about char array addressing.
          Hiroaki Kawai made changes -
          Attachment NGramTokenizer.patch [ 12377794 ]
          Hide
          Grant Ingersoll added a comment -

          Hi Hiroaki,

          Thanks for the patch. Can you add unit tests for your patch?

          Show
          Grant Ingersoll added a comment - Hi Hiroaki, Thanks for the patch. Can you add unit tests for your patch?
          Hide
          Hiroaki Kawai added a comment -

          NGramTokenizerTest#testIndexAndQueryLong will test this issues.

          Show
          Hiroaki Kawai added a comment - NGramTokenizerTest#testIndexAndQueryLong will test this issues.
          Hiroaki Kawai made changes -
          Attachment LUCENE-1227.patch [ 12377998 ]
          Hide
          Otis Gospodnetic added a comment -

          Thanks for the test and for addressing this!

          Could you add some examples for NO_OPTIMIZE and QUERY_OPTIMIZE? I can't tell from looking at the patch what those are about. Also, note how existing variables use camelCaseLikeThis. It would be good to stick to the same pattern (instead of bufflen, buffpos, etc.), as well as to the existing style (e.g. space between if and open paren, spaces around == and =, etc.)

          I'll commit as soon as you make these changes, assuming you can make them. Thank you.

          Show
          Otis Gospodnetic added a comment - Thanks for the test and for addressing this! Could you add some examples for NO_OPTIMIZE and QUERY_OPTIMIZE? I can't tell from looking at the patch what those are about. Also, note how existing variables use camelCaseLikeThis. It would be good to stick to the same pattern (instead of bufflen, buffpos, etc.), as well as to the existing style (e.g. space between if and open paren, spaces around == and =, etc.) I'll commit as soon as you make these changes, assuming you can make them. Thank you.
          Hide
          Michael Dodsworth added a comment -

          Any progress on getting this patch into a release? I can take a look if nobody else is.

          Show
          Michael Dodsworth added a comment - Any progress on getting this patch into a release? I can take a look if nobody else is.
          Hide
          Grant Ingersoll added a comment -

          Yes, please do have a look and let us know what you think.

          Show
          Grant Ingersoll added a comment - Yes, please do have a look and let us know what you think.
          Robert Muir made changes -
          Component/s contrib/analyzers [ 12312333 ]
          Component/s contrib/* [ 12312028 ]
          Mark Thomas made changes -
          Workflow jira [ 12425956 ] Default workflow, editable Closed status [ 12563499 ]
          Mark Thomas made changes -
          Workflow Default workflow, editable Closed status [ 12563499 ] jira [ 12585095 ]
          Shai Erera made changes -
          Component/s modules/analysis [ 12310230 ]
          Component/s contrib/analyzers [ 12312333 ]
          Hide
          Torao Takami added a comment - - edited

          I avoided this problem by putting NGramFilterFactory and WhitespaceTokenizerFactory together in Solr schema.

          <tokenizer class="solr.NGramTokenizerFactory" maxGramSize="2" minGramSize="2" />

          is written to:

          <filter class="solr.NGramFilterFactory" minGramSize="2" maxGramSize="2"/>
          <tokenizer class="solr.WhitespaceTokenizerFactory"/>

          Platform: Solr 1.4.1

          Show
          Torao Takami added a comment - - edited I avoided this problem by putting NGramFilterFactory and WhitespaceTokenizerFactory together in Solr schema. <tokenizer class="solr.NGramTokenizerFactory" maxGramSize="2" minGramSize="2" /> is written to: <filter class="solr.NGramFilterFactory" minGramSize="2" maxGramSize="2"/> <tokenizer class="solr.WhitespaceTokenizerFactory"/> Platform: Solr 1.4.1
          Grant Ingersoll made changes -
          Assignee Grant Ingersoll [ gsingers ]
          Hide
          Raimund Merkert added a comment -

          For me, this also works (for my purposes, at least):

          String str = <read contents of reader to string>

          TokenStream tokens = new KeywordTokenizer(new StringReader(str.trim());
          tokens= new NGramTokenFilter(tokens, minNGram, maxNGram);

          Show
          Raimund Merkert added a comment - For me, this also works (for my purposes, at least): String str = <read contents of reader to string> TokenStream tokens = new KeywordTokenizer(new StringReader(str.trim()); tokens= new NGramTokenFilter(tokens, minNGram, maxNGram);
          Hide
          Harald Wellmann added a comment -

          As long as this issue is not fixed, please mention the 1024 character truncation in the Javadoc.

          The combination of KeywordTokenizer and NGramTokenFilter does not scale well for large inputs, as KeywordTokenizer reads the entire input stream into a character buffer.

          Show
          Harald Wellmann added a comment - As long as this issue is not fixed, please mention the 1024 character truncation in the Javadoc. The combination of KeywordTokenizer and NGramTokenFilter does not scale well for large inputs, as KeywordTokenizer reads the entire input stream into a character buffer.
          Hide
          David Byrne added a comment -

          I proposed a patch that fixes this problem on another ticket:

          https://issues.apache.org/jira/browse/LUCENE-2947

          Show
          David Byrne added a comment - I proposed a patch that fixes this problem on another ticket: https://issues.apache.org/jira/browse/LUCENE-2947
          Hide
          Adrien Grand added a comment -

          LUCENE-4955 fixed this issue.

          Show
          Adrien Grand added a comment - LUCENE-4955 fixed this issue.
          Adrien Grand made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Hide
          Adrien Grand added a comment -

          David, sorry I didn't know about your patch and happened to fix this issue as part of LUCENE-4955. Your patch seems to operate very similarly and adds supports for whitespace collapsing, is that correct? Don't hesitate to tell me if you think the current implementation needs improvements.

          Show
          Adrien Grand added a comment - David, sorry I didn't know about your patch and happened to fix this issue as part of LUCENE-4955 . Your patch seems to operate very similarly and adds supports for whitespace collapsing, is that correct? Don't hesitate to tell me if you think the current implementation needs improvements.

            People

            • Assignee:
              Unassigned
              Reporter:
              Hiroaki Kawai
            • Votes:
              3 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development