Issue Details (XML | Word | Printable)

Key: LUCENE-759
Type: Improvement Improvement
Status: Resolved Resolved
Resolution: Fixed
Priority: Minor Minor
Assignee: Otis Gospodnetic
Reporter: Otis Gospodnetic
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Lucene - Java

Add n-gram tokenizers to contrib/analyzers

Created: 22/Dec/06 11:22 PM   Updated: 14/Jan/08 06:39 PM
Return to search
Component/s: Analysis
Affects Version/s: None
Fix Version/s: None

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works LUCENE-759-filters.patch 2007-03-02 06:17 PM Otis Gospodnetic 19 kB
Text File Licensed for inclusion in ASF works LUCENE-759.patch 2007-02-21 02:23 PM Otis Gospodnetic 11 kB
Text File Licensed for inclusion in ASF works LUCENE-759.patch 2007-02-16 08:04 PM Otis Gospodnetic 12 kB
Text File Licensed for inclusion in ASF works LUCENE-759.patch 2006-12-22 11:43 PM Otis Gospodnetic 15 kB
Issue Links:
Reference
 

Lucene Fields: Patch Available, New
Resolution Date: 13/Jul/07 01:54 PM


 Description  « Hide
It would be nice to have some n-gram-capable tokenizers in contrib/analyzers. Patch coming shortly.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Otis Gospodnetic added a comment - 22/Dec/06 11:43 PM
Included:
NGramTokenizer
NGramTokenizerTest
EdgeNGramTokenizer
EdgeNGramTokenizerTest

Otis Gospodnetic added a comment - 22/Dec/06 11:44 PM
Unit tests pass, committed.

Otis Gospodnetic added a comment - 16/Feb/07 08:01 PM
Reopening, because I'm bringing in Adam Hiatt's modifications that he uploaded in a patch for SOLR-81. Adam's changes allow this tokenizer to create n-grams whose sizes are specified as a min-max range.

This patch fixes a bug in Adam's code, but has another bug that I don't know how to fix now.
Adam's bug:
input: abcde
minGram: 1
maxGram: 3
output: a ab abc – and this is where tokenizing stopped, which was wrong, it should have continued: b bc bcd c cd cde d de e

Otis' bug:
input: abcde
minGeam: 1
maxGram: 3
output: e de cde d cd bcd c bc abc b ab – and this is where tokenizing stops, which is wrong, it should generate one more n-gram: a

This bug won't hurt SOLR-81, but it should be fixed.


Otis Gospodnetic added a comment - 16/Feb/07 08:04 PM
The modified tokenizer and the extended unit test.

Adam Hiatt added a comment - 16/Feb/07 11:10 PM
Otis: this really isn't a bug. The min/max gram code I added only applied to the EdgeNGramTokenizer. I only want to generate edge n-grams between the range of sizes provided.

For example, with the EdgeNGramTokenizer
input: abcde
minGram: 1
maxGram: 3

'a ab abc' is in fact what I intended to produce.

I think it makes more sense for the functionality to which you referred to be located in NGramTokenizer.


Otis Gospodnetic added a comment - 17/Feb/07 08:10 AM
Damn, I think you are right! Once again, I'm making late night mistakes. When will I learn!?
But I could take my code to NGramTokenizer then, at least.
My bug remains, though..... got an idea for a fix?

Otis Gospodnetic added a comment - 21/Feb/07 02:23 PM
Here is the proper version. This one is essentially the Lucene-n-gram-analyzer-specific Adam's patch from SOLR-81 + some passing unit tests I wrote to exercise the new n-gram range functionality.

I'll commit this by the end of the week unless Adam spots a bug.


Otis Gospodnetic added a comment - 01/Mar/07 02:23 PM
In SVN.

Doron Cohen added a comment - 01/Mar/07 10:49 PM
I have two comments/questions on the n-gram tokenizers:

(1) Seems that only the first 1024 characters of the input are handled, and the rest is ignored (and I think as result the input stream would remain dangling open).

If you add this test case:

/**

  • Test that no ngrams are lost, even for really long inputs
  • @throws EXception
    */
    public void testLongerInput() throws Exception {
    int expectedNumTokens = 1024;
    int ngramLength = 2;
    // prepare long string
    StringBuffer sb = new StringBuffer();
    while (sb.length()<expectedNumTokens+ngramLength-1)
    sb.append('a');

StringReader longStringReader = new StringReader (sb.toString());
NGramTokenizer tokenizer = new NGramTokenizer(longStringReader, ngramLength, ngramLength);
int numTokens = 0;
Token token;
while ((token = tokenizer.next())!=null) { numTokens++; assertEquals("aa",token.termText()); }
assertEquals("wrong number of tokens",expectedNumTokens,numTokens);
}

With expectedNumTokens = 1023 it would pass, but any larger number would fail.

(2) It seems safer to read the characters like this
int n = input.read(chars);
inStr = new String(chars, 0, n);
(This way not counting on String.trim(), which does work, but worries me).


Otis Gospodnetic added a comment - 02/Mar/07 06:13 PM
More goodies coming.

Otis Gospodnetic added a comment - 02/Mar/07 06:17 PM
N-gram-producting TokenFilters for Karl's mom.

Doron Cohen added a comment - 03/Mar/07 04:01 AM
Hi Otis,

> (and I think as result the input stream would remain dangling open)

I take this part back - closing tokenStream would close the reader, and at least for the case that I thought of, invertDocument, the tokenStream is properly closed.

Can you comment on the input length: is it correct to handle only the first 1024 characters?

Thanks,
Doron


Otis Gospodnetic added a comment - 03/Mar/07 04:44 PM
Ah, didn't see your comments here earlier, Doron. Yes, I think you are correct about the 1024 limit - when I wrote that Tokenizer I was thinking TokenFilter, and thus I was thinking that that input Reader represents a Token, which was wrong. So, I thought, "oh, 1024 chars/token, that will be enough". I ended up needing TokenFilters for SOLR-81, so that's what I checked in. Those operate on tokens and don't have the 1024 limitation.

Anyhow, feel free to slap your test + the fix in and thanks for checking!


Patrick Turcotte added a comment - 08/Mar/07 04:35 AM
Is it just me or are the NGramTokenFilter and EdgeNGramTokenFilter class not committed to SVN and not in the patch either?

NGramTokenFilterTest and EdgeNGramTokenFilterTest are referring to them, but I can not seem to find them.

Thanks and keep the good work.

Patrick


Hoss Man added a comment - 08/Mar/07 06:14 PM
Otis's most recent attachment contains only tests .. but previous attachemnts had implementations.

all of which have been commited under contrib/analyzers

(tip: if you click "All" at the top of the list of comments in Jira, you see every modification related to this issue, including subversion commits that Jira detects are related to the issue based on the commit message)


Hoss Man added a comment - 08/Mar/07 09:43 PM
Ack! ... i'm sorry i completely missread Patrick's question.

ngram Tokenizers have been commited – but there are no ngram TokenFilters ... there are tests for TokenFilters Otis commited on March2, but those tests do't currently compile/run without the TokenFilter's themselves.

Otis: do you have these TokenFilter's in your working directory that you perhaps forgot to svn add before committing?


Otis Gospodnetic added a comment - 13/Mar/07 12:31 AM
Oh, look at that!

[otis@localhost contrib]$ svn st
A analyzers/src/java/org/apache/lucene/analysis/ngram/EdgeNGramTokenFilter.java
A analyzers/src/java/org/apache/lucene/analysis/ngram/NGramTokenFilter.java


It's in the repo now. Sorry about that!


Daniel Naber added a comment - 01/Jun/07 06:53 PM
Can this issue be closed or is there anything still open?

Otis Gospodnetic added a comment - 13/Jul/07 01:54 PM
This should have been marked Fixed a while back.