Lucene - Core
  1. Lucene - Core
  2. LUCENE-400

NGramFilter -- construct n-grams from a TokenStream

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.4
    • Component/s: modules/analysis
    • Labels:
      None
    • Environment:

      Operating System: All
      Platform: All

      Description

      This filter constructs n-grams (token combinations up to a fixed size, sometimes
      called "shingles") from a token stream.

      The filter sets start offsets, end offsets and position increments, so
      highlighting and phrase queries should work.

      Position increments > 1 in the input stream are replaced by filler tokens
      (tokens with termText "_" and endOffset - startOffset = 0) in the output
      n-grams. (Position increments > 1 in the input stream are usually caused by
      removing some tokens, eg. stopwords, from a stream.)

      The filter uses CircularFifoBuffer and UnboundedFifoBuffer from Apache
      Commons-Collections.

      Filter, test case and an analyzer are attached.

      1. LUCENE-400.patch
        26 kB
        Steve Rowe
      2. ASF.LICENSE.NOT.GRANTED--NGramAnalyzerWrapperTest.java
        5 kB
        Sebastian Kirsch
      3. ASF.LICENSE.NOT.GRANTED--NGramFilterTest.java
        6 kB
        Sebastian Kirsch
      4. ASF.LICENSE.NOT.GRANTED--NGramAnalyzerWrapper.java
        2 kB
        Sebastian Kirsch
      5. ASF.LICENSE.NOT.GRANTED--NGramFilter.java
        6 kB
        Sebastian Kirsch

        Activity

        Hide
        Sebastian Kirsch added a comment -

        Created an attachment (id=15504)
        NGramFilter

        Show
        Sebastian Kirsch added a comment - Created an attachment (id=15504) NGramFilter
        Hide
        Sebastian Kirsch added a comment -

        Created an attachment (id=15505)
        NGramAnalyzerWrapper (wraps an NGramFilter around an analyzer.)

        Show
        Sebastian Kirsch added a comment - Created an attachment (id=15505) NGramAnalyzerWrapper (wraps an NGramFilter around an analyzer.)
        Hide
        Sebastian Kirsch added a comment -

        Created an attachment (id=15506)
        JUnit TestCase for NGramFilter

        Show
        Sebastian Kirsch added a comment - Created an attachment (id=15506) JUnit TestCase for NGramFilter
        Hide
        Robert Newson added a comment -
        • <p>For example, the sentence "please divide this sentence into ngrams" would be
        • tokenized into the tokens "please divide", "this sentence", "sentence into", and
        • "into ngrams".

        The comment should read;

        • <p>For example, the sentence "please divide this sentence into ngrams" would be
        • tokenized into the tokens "please divide", "divide this", "this sentence",
          "sentence into", and
        • "into ngrams".
        Show
        Robert Newson added a comment - <p>For example, the sentence "please divide this sentence into ngrams" would be tokenized into the tokens "please divide", "this sentence", "sentence into", and "into ngrams". The comment should read; <p>For example, the sentence "please divide this sentence into ngrams" would be tokenized into the tokens "please divide", "divide this", "this sentence", "sentence into", and "into ngrams".
        Hide
        Sebastian Kirsch added a comment -

        Created an attachment (id=15818)
        JUnit test class for NGramAnalyzerWrapper

        The tests in this class are concerned with the interaction between QueryParser
        and an NGramAnalyzer, and whether searching works as expected on an index
        constructed with an NGramAnalyzer.

        One of the test cases throws an exception that I haven't investigated yet. So
        proceed with caution if you use the QueryParser with NGramAnalyzer.

        ..E....
        Time: 1.771
        There was 1 error:
        1)
        testNGramAnalyzerWrapperPhraseQueryParsingFails(org.apache.lucene.analysis.NGramAnalyzerWrapperTest)java.lang.NullPointerException

        at
        org.apache.lucene.index.MultipleTermPositions.skipTo(MultipleTermPositions.java:178)

        at
        org.apache.lucene.search.PhrasePositions.skipTo(PhrasePositions.java:47)
        at org.apache.lucene.search.PhraseScorer.doNext(PhraseScorer.java:73)
        at org.apache.lucene.search.PhraseScorer.next(PhraseScorer.java:66)
        at org.apache.lucene.search.Scorer.score(Scorer.java:47)
        at
        org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:102)
        at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:65)
        at org.apache.lucene.search.Hits.<init>(Hits.java:44)
        at org.apache.lucene.search.Searcher.search(Searcher.java:40)
        at org.apache.lucene.search.Searcher.search(Searcher.java:32)
        at
        org.apache.lucene.analysis.NGramAnalyzerWrapperTest.queryParsingTest(NGramAnalyzerWrapperTest.java:75)

        at
        org.apache.lucene.analysis.NGramAnalyzerWrapperTest.testNGramAnalyzerWrapperPhraseQueryParsingFails(NGramAnalyzerWrapperTest.java:100)

        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
        sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at
        sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

        at
        org.apache.lucene.analysis.NGramAnalyzerWrapperTest.main(NGramAnalyzerWrapperTest.java:36)

        FAILURES!!!
        Tests run: 6, Failures: 0, Errors: 1

        Show
        Sebastian Kirsch added a comment - Created an attachment (id=15818) JUnit test class for NGramAnalyzerWrapper The tests in this class are concerned with the interaction between QueryParser and an NGramAnalyzer, and whether searching works as expected on an index constructed with an NGramAnalyzer. One of the test cases throws an exception that I haven't investigated yet. So proceed with caution if you use the QueryParser with NGramAnalyzer. ..E.... Time: 1.771 There was 1 error: 1) testNGramAnalyzerWrapperPhraseQueryParsingFails(org.apache.lucene.analysis.NGramAnalyzerWrapperTest)java.lang.NullPointerException at org.apache.lucene.index.MultipleTermPositions.skipTo(MultipleTermPositions.java:178) at org.apache.lucene.search.PhrasePositions.skipTo(PhrasePositions.java:47) at org.apache.lucene.search.PhraseScorer.doNext(PhraseScorer.java:73) at org.apache.lucene.search.PhraseScorer.next(PhraseScorer.java:66) at org.apache.lucene.search.Scorer.score(Scorer.java:47) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:102) at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:65) at org.apache.lucene.search.Hits.<init>(Hits.java:44) at org.apache.lucene.search.Searcher.search(Searcher.java:40) at org.apache.lucene.search.Searcher.search(Searcher.java:32) at org.apache.lucene.analysis.NGramAnalyzerWrapperTest.queryParsingTest(NGramAnalyzerWrapperTest.java:75) at org.apache.lucene.analysis.NGramAnalyzerWrapperTest.testNGramAnalyzerWrapperPhraseQueryParsingFails(NGramAnalyzerWrapperTest.java:100) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at org.apache.lucene.analysis.NGramAnalyzerWrapperTest.main(NGramAnalyzerWrapperTest.java:36) FAILURES!!! Tests run: 6, Failures: 0, Errors: 1
        Hide
        Otis Gospodnetic added a comment -

        Sebastian, ever figured out the problem? Also, is there a way to get rid of the Commons Collections? Lucene has no run-time dependencies on other libraries.

        Show
        Otis Gospodnetic added a comment - Sebastian, ever figured out the problem? Also, is there a way to get rid of the Commons Collections? Lucene has no run-time dependencies on other libraries.
        Hide
        Sebastian Kirsch added a comment -

        Hi Otis,

        I did not figure out the problem. Getting rid of Commons Collection should be no problem; I am just using them as FIFOs. However, I do not have the time at the moment to implement this.

        Kind regards, Sebastian

        Show
        Sebastian Kirsch added a comment - Hi Otis, I did not figure out the problem. Getting rid of Commons Collection should be no problem; I am just using them as FIFOs. However, I do not have the time at the moment to implement this. Kind regards, Sebastian
        Hide
        Grant Ingersoll added a comment -

        Lucene has NGram support

        Show
        Grant Ingersoll added a comment - Lucene has NGram support
        Hide
        Steve Rowe added a comment -

        Lucene has character NGram support, but not word NGram support, which this filter supplies:

        This filter constructs n-grams (token combinations up to a fixed size, sometimes called "shingles") from a token stream.

        Show
        Steve Rowe added a comment - Lucene has character NGram support, but not word NGram support, which this filter supplies: This filter constructs n-grams (token combinations up to a fixed size, sometimes called "shingles") from a token stream.
        Hide
        Grant Ingersoll added a comment -

        Good catch, Steve. I will reopen, as a word based ngram filter is useful.

        Show
        Grant Ingersoll added a comment - Good catch, Steve. I will reopen, as a word based ngram filter is useful.
        Hide
        Steve Rowe added a comment -

        Repackaged these four files as a patch, with the following modifications to the code:

        • Renamed files and variables to refer to "n-grams" as "shingles", to avoid confusion with the character-level n-gram code already in Lucene's sandbox
        • Placed code in the o.a.l.analysis.shingle package
        • Converted commons-collections FIFO usages to LinkedLists
        • Removed @author from javadocs
        • Changed deprecated Lucene API usages to alternate forms; addressed all compilation warnings
        • Changed code style to conform to Lucene conventions
        • Changed field setters to return null instead of a reference to the class instance, then changed instantiations to use individual setter calls instead of the chained calling style
        • Added ASF license to each file

        All tests pass.

        Although I left in the ShingleAnalyzerWrapper and its test in the patch, no other Lucene filter (AFAICT) has such a filter wrapping facility. My vote is to remove these two files.

        Show
        Steve Rowe added a comment - Repackaged these four files as a patch, with the following modifications to the code: Renamed files and variables to refer to "n-grams" as "shingles", to avoid confusion with the character-level n-gram code already in Lucene's sandbox Placed code in the o.a.l.analysis.shingle package Converted commons-collections FIFO usages to LinkedLists Removed @author from javadocs Changed deprecated Lucene API usages to alternate forms; addressed all compilation warnings Changed code style to conform to Lucene conventions Changed field setters to return null instead of a reference to the class instance, then changed instantiations to use individual setter calls instead of the chained calling style Added ASF license to each file All tests pass. Although I left in the ShingleAnalyzerWrapper and its test in the patch, no other Lucene filter (AFAICT) has such a filter wrapping facility. My vote is to remove these two files.
        Hide
        Grant Ingersoll added a comment -

        Thanks, Steve. I will mark this as 2.4

        Show
        Grant Ingersoll added a comment - Thanks, Steve. I will mark this as 2.4
        Hide
        Steve Rowe added a comment -

        Removed the duplicate link (to LUCENE-759), since that issue is about character-level n-grams, and this issue is about word-level n-grams.

        Show
        Steve Rowe added a comment - Removed the duplicate link (to LUCENE-759 ), since that issue is about character-level n-grams, and this issue is about word-level n-grams.
        Hide
        Otis Gospodnetic added a comment -

        Thanks for bringing this up to date. I'll commit it after 2.3 is out.

        Show
        Otis Gospodnetic added a comment - Thanks for bringing this up to date. I'll commit it after 2.3 is out.
        Hide
        Grant Ingersoll added a comment -

        ping, Otis, do you still plan to commit?

        Show
        Grant Ingersoll added a comment - ping, Otis, do you still plan to commit?
        Hide
        Steve Rowe added a comment -

        re-ping, Otis, do you still plan to commit?

        Show
        Steve Rowe added a comment - re-ping, Otis, do you still plan to commit?
        Hide
        Otis Gospodnetic added a comment -

        Sorry for hogging. Got some local compilation issues with the query builder in contrib, so assigning to Grant to get this in.

        Show
        Otis Gospodnetic added a comment - Sorry for hogging. Got some local compilation issues with the query builder in contrib, so assigning to Grant to get this in.
        Hide
        Grant Ingersoll added a comment -

        Committed revision 642612.

        Thanks Sebastian and Steve

        Show
        Grant Ingersoll added a comment - Committed revision 642612. Thanks Sebastian and Steve

          People

          • Assignee:
            Grant Ingersoll
            Reporter:
            Sebastian Kirsch
          • Votes:
            5 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development