Issue Details (XML | Word | Printable)

Key: LUCENE-400
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Minor Minor
Assignee: Grant Ingersoll
Reporter: Sebastian Kirsch
Votes: 5
Watchers: 3
Operations

If you were logged in you would be able to see more operations.
Lucene - Java

NGramFilter -- construct n-grams from a TokenStream

Created: 22/Jun/05 06:08 AM   Updated: 11/Oct/08 12:49 PM
Return to search
Component/s: Analysis
Affects Version/s: unspecified
Fix Version/s: 2.4

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works LUCENE-400.patch 2008-01-14 04:15 AM Steven Rowe 26 kB
Java Source File NGramAnalyzerWrapper.java 2005-06-22 06:10 AM Sebastian Kirsch 2 kB
Java Source File NGramAnalyzerWrapperTest.java 2005-07-29 09:56 PM Sebastian Kirsch 5 kB
Java Source File NGramFilter.java 2005-06-22 06:09 AM Sebastian Kirsch 6 kB
Java Source File NGramFilterTest.java 2005-06-22 06:12 AM Sebastian Kirsch 6 kB
Environment:
Operating System: All
Platform: All

Bugzilla Id: 35456
Lucene Fields: Patch Available
Resolution Date: 29/Mar/08 09:09 PM


 Description  « Hide
This filter constructs n-grams (token combinations up to a fixed size, sometimes
called "shingles") from a token stream.

The filter sets start offsets, end offsets and position increments, so
highlighting and phrase queries should work.

Position increments > 1 in the input stream are replaced by filler tokens
(tokens with termText "_" and endOffset - startOffset = 0) in the output
n-grams. (Position increments > 1 in the input stream are usually caused by
removing some tokens, eg. stopwords, from a stream.)

The filter uses CircularFifoBuffer and UnboundedFifoBuffer from Apache
Commons-Collections.

Filter, test case and an analyzer are attached.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Sebastian Kirsch added a comment - 22/Jun/05 06:09 AM
Created an attachment (id=15504)
NGramFilter

Sebastian Kirsch added a comment - 22/Jun/05 06:10 AM
Created an attachment (id=15505)
NGramAnalyzerWrapper (wraps an NGramFilter around an analyzer.)

Sebastian Kirsch added a comment - 22/Jun/05 06:12 AM
Created an attachment (id=15506)
JUnit TestCase for NGramFilter

Robert Newson added a comment - 29/Jun/05 10:26 AM
  • <p>For example, the sentence "please divide this sentence into ngrams" would be
  • tokenized into the tokens "please divide", "this sentence", "sentence into", and
  • "into ngrams".

The comment should read;

  • <p>For example, the sentence "please divide this sentence into ngrams" would be
  • tokenized into the tokens "please divide", "divide this", "this sentence",
    "sentence into", and
  • "into ngrams".

Sebastian Kirsch added a comment - 29/Jul/05 09:56 PM
Created an attachment (id=15818)
JUnit test class for NGramAnalyzerWrapper

The tests in this class are concerned with the interaction between QueryParser
and an NGramAnalyzer, and whether searching works as expected on an index
constructed with an NGramAnalyzer.

One of the test cases throws an exception that I haven't investigated yet. So
proceed with caution if you use the QueryParser with NGramAnalyzer.

..E....
Time: 1.771
There was 1 error:
1)
testNGramAnalyzerWrapperPhraseQueryParsingFails(org.apache.lucene.analysis.NGramAnalyzerWrapperTest)java.lang.NullPointerException

at
org.apache.lucene.index.MultipleTermPositions.skipTo(MultipleTermPositions.java:178)

at
org.apache.lucene.search.PhrasePositions.skipTo(PhrasePositions.java:47)
at org.apache.lucene.search.PhraseScorer.doNext(PhraseScorer.java:73)
at org.apache.lucene.search.PhraseScorer.next(PhraseScorer.java:66)
at org.apache.lucene.search.Scorer.score(Scorer.java:47)
at
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:102)
at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:65)
at org.apache.lucene.search.Hits.<init>(Hits.java:44)
at org.apache.lucene.search.Searcher.search(Searcher.java:40)
at org.apache.lucene.search.Searcher.search(Searcher.java:32)
at
org.apache.lucene.analysis.NGramAnalyzerWrapperTest.queryParsingTest(NGramAnalyzerWrapperTest.java:75)

at
org.apache.lucene.analysis.NGramAnalyzerWrapperTest.testNGramAnalyzerWrapperPhraseQueryParsingFails(NGramAnalyzerWrapperTest.java:100)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

at
org.apache.lucene.analysis.NGramAnalyzerWrapperTest.main(NGramAnalyzerWrapperTest.java:36)

FAILURES!!!
Tests run: 6, Failures: 0, Errors: 1


Jeff Turner made changes - 03/Sep/05 03:29 PM
Field Original Value New Value
issue.field.bugzillaimportkey 35456 12314550
Otis Gospodnetic added a comment - 09/Jul/06 09:51 PM
Sebastian, ever figured out the problem? Also, is there a way to get rid of the Commons Collections? Lucene has no run-time dependencies on other libraries.

Sebastian Kirsch added a comment - 07/Aug/06 09:14 PM
Hi Otis,

I did not figure out the problem. Getting rid of Commons Collection should be no problem; I am just using them as FIFOs. However, I do not have the time at the moment to implement this.

Kind regards, Sebastian


Grant Ingersoll added a comment - 12/Jan/08 10:56 PM
Lucene has NGram support

Grant Ingersoll made changes - 12/Jan/08 10:56 PM
Resolution Won't Fix [ 2 ]
Status Open [ 1 ] Closed [ 6 ]
Assignee Lucene Developers [ java-dev@lucene.apache.org ]
Grant Ingersoll made changes - 12/Jan/08 11:13 PM
Link This issue duplicates LUCENE-759 [ LUCENE-759 ]
Steven Rowe added a comment - 13/Jan/08 05:36 AM
Lucene has character NGram support, but not word NGram support, which this filter supplies:

This filter constructs n-grams (token combinations up to a fixed size, sometimes called "shingles") from a token stream.


Grant Ingersoll added a comment - 13/Jan/08 01:33 PM
Good catch, Steve. I will reopen, as a word based ngram filter is useful.

Grant Ingersoll made changes - 13/Jan/08 01:33 PM
Status Closed [ 6 ] Reopened [ 4 ]
Resolution Won't Fix [ 2 ]
Steven Rowe added a comment - 14/Jan/08 04:15 AM
Repackaged these four files as a patch, with the following modifications to the code:
  • Renamed files and variables to refer to "n-grams" as "shingles", to avoid confusion with the character-level n-gram code already in Lucene's sandbox
  • Placed code in the o.a.l.analysis.shingle package
  • Converted commons-collections FIFO usages to LinkedLists
  • Removed @author from javadocs
  • Changed deprecated Lucene API usages to alternate forms; addressed all compilation warnings
  • Changed code style to conform to Lucene conventions
  • Changed field setters to return null instead of a reference to the class instance, then changed instantiations to use individual setter calls instead of the chained calling style
  • Added ASF license to each file

All tests pass.

Although I left in the ShingleAnalyzerWrapper and its test in the patch, no other Lucene filter (AFAICT) has such a filter wrapping facility. My vote is to remove these two files.


Steven Rowe made changes - 14/Jan/08 04:15 AM
Attachment LUCENE-400.patch [ 12373074 ]
Grant Ingersoll added a comment - 14/Jan/08 12:29 PM
Thanks, Steve. I will mark this as 2.4

Grant Ingersoll made changes - 14/Jan/08 12:29 PM
Lucene Fields [Patch Available]
Fix Version/s 2.4 [ 12312681 ]
Steven Rowe made changes - 14/Jan/08 06:38 PM
Link This issue duplicates LUCENE-759 [ LUCENE-759 ]
Steven Rowe added a comment - 14/Jan/08 06:40 PM
Removed the duplicate link (to LUCENE-759), since that issue is about character-level n-grams, and this issue is about word-level n-grams.

Otis Gospodnetic added a comment - 14/Jan/08 06:55 PM
Thanks for bringing this up to date. I'll commit it after 2.3 is out.

Otis Gospodnetic made changes - 14/Jan/08 06:55 PM
Assignee Otis Gospodnetic [ otis ]
Grant Ingersoll added a comment - 03/Mar/08 01:28 PM
ping, Otis, do you still plan to commit?

Steven Rowe added a comment - 18/Mar/08 05:20 PM
re-ping, Otis, do you still plan to commit?

Otis Gospodnetic added a comment - 25/Mar/08 10:39 PM
Sorry for hogging. Got some local compilation issues with the query builder in contrib, so assigning to Grant to get this in.

Otis Gospodnetic made changes - 25/Mar/08 10:39 PM
Assignee Otis Gospodnetic [ otis ] Grant Ingersoll [ gsingers ]
Grant Ingersoll added a comment - 29/Mar/08 09:09 PM
Committed revision 642612.

Thanks Sebastian and Steve


Grant Ingersoll made changes - 29/Mar/08 09:09 PM
Resolution Fixed [ 1 ]
Status Reopened [ 4 ] Resolved [ 5 ]
Repository Revision Date User Message
ASF #642612 Sat Mar 29 21:11:33 UTC 2008 gsingers LUCENE-400: Added ShingleFilter (token based ngram)
Files Changed
ADD /lucene/java/trunk/contrib/analyzers/src/test/org/apache/lucene/analysis/shingle/ShingleAnalyzerWrapperTest.java
ADD /lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/shingle/ShingleAnalyzerWrapper.java
ADD /lucene/java/trunk/contrib/analyzers/src/test/org/apache/lucene/analysis/shingle/ShingleFilterTest.java
ADD /lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/shingle/ShingleFilter.java
ADD /lucene/java/trunk/contrib/analyzers/src/test/org/apache/lucene/analysis/shingle
ADD /lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/shingle
MODIFY /lucene/java/trunk/CHANGES.txt

Michael McCandless made changes - 11/Oct/08 12:49 PM
Status Resolved [ 5 ] Closed [ 6 ]