|
Unit tests pass, committed.
Reopening, because I'm bringing in Adam Hiatt's modifications that he uploaded in a patch for
This patch fixes a bug in Adam's code, but has another bug that I don't know how to fix now. Otis' bug: This bug won't hurt The modified tokenizer and the extended unit test.
Otis: this really isn't a bug. The min/max gram code I added only applied to the EdgeNGramTokenizer. I only want to generate edge n-grams between the range of sizes provided.
For example, with the EdgeNGramTokenizer 'a ab abc' is in fact what I intended to produce. I think it makes more sense for the functionality to which you referred to be located in NGramTokenizer. Damn, I think you are right!
But I could take my code to NGramTokenizer then, at least. My bug remains, though..... got an idea for a fix? Here is the proper version. This one is essentially the Lucene-n-gram-analyzer-specific Adam's patch from
I'll commit this by the end of the week unless Adam spots a bug. I have two comments/questions on the n-gram tokenizers:
(1) Seems that only the first 1024 characters of the input are handled, and the rest is ignored (and I think as result the input stream would remain dangling open). If you add this test case: /**
StringReader longStringReader = new StringReader (sb.toString()); With expectedNumTokens = 1023 it would pass, but any larger number would fail. (2) It seems safer to read the characters like this N-gram-producting TokenFilters for Karl's mom.
Hi Otis,
> (and I think as result the input stream would remain dangling open) I take this part back - closing tokenStream would close the reader, and at least for the case that I thought of, invertDocument, the tokenStream is properly closed. Can you comment on the input length: is it correct to handle only the first 1024 characters? Thanks, Ah, didn't see your comments here earlier, Doron. Yes, I think you are correct about the 1024 limit - when I wrote that Tokenizer I was thinking TokenFilter, and thus I was thinking that that input Reader represents a Token, which was wrong. So, I thought, "oh, 1024 chars/token, that will be enough". I ended up needing TokenFilters for
Anyhow, feel free to slap your test + the fix in and thanks for checking! Is it just me or are the NGramTokenFilter and EdgeNGramTokenFilter class not committed to SVN and not in the patch either?
NGramTokenFilterTest and EdgeNGramTokenFilterTest are referring to them, but I can not seem to find them. Thanks and keep the good work. Patrick Otis's most recent attachment contains only tests .. but previous attachemnts had implementations.
all of which have been commited under contrib/analyzers (tip: if you click "All" at the top of the list of comments in Jira, you see every modification related to this issue, including subversion commits that Jira detects are related to the issue based on the commit message) Ack! ... i'm sorry i completely missread Patrick's question.
ngram Tokenizers have been commited – but there are no ngram TokenFilters ... there are tests for TokenFilters Otis commited on March2, but those tests do't currently compile/run without the TokenFilter's themselves. Otis: do you have these TokenFilter's in your working directory that you perhaps forgot to svn add before committing? Oh, look at that!
[otis@localhost contrib]$ svn st
Can this issue be closed or is there anything still open?
This should have been marked Fixed a while back.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
NGramTokenizer
NGramTokenizerTest
EdgeNGramTokenizer
EdgeNGramTokenizerTest