|
Stanislaw Osinski made changes - 26/Jul/07 01:12 PM
Here is a very simple benchmark I used to test the performance of StandardAnalyzer, FastAnalyzer and WhitespaceAnalyzer. I ran it on a number of JVMs and got the following results:
Input: Reuters collection, the one used by contrib/benchmark, only documents Machine: AMD Sempron 2600+, 2G RAM, Windows XP Sun 1.4.2 Server Sun 1.4.2 Client Sun 1.5.0 Server Sun 1.5.0 Client Sun 1.6.0 Server Sun 1.6.0 Client IBM 1.4.2 IBM 1.5.0 BEA 1.4.2 BEA 1.5.0 (looks really slow)
Stanislaw Osinski made changes - 26/Jul/07 01:15 PM
Thanks Staszek, very nice!
If FastTokenizer is compatible with StandardTokenizer, it should probably just replace it. Have you tried this replacement and running all the lucene tests? Great patch! And a very quick turnaround to boot!
I am seeing a HUGE speed increase! I also think this should certainly become the new StandardAnalyzer. I am running into only one issue...when rebuilding, after a clean, FastTokenizer does not seem to get generated – FastTokenizerImpl is there, but no FastTokenizer interface; it appears to be wiped out in the clean but not regenerated in the build. If I put it in manually, everything works great.
We did pretty much the same thing here at Aconex, The tokenization mechanism in the old javacc-based analyser is woeful compared to what JFlex outputs.
Nice work! Here's another patch (against r560135) which:
Additionally, I noticed that ChainedFilterTest from contrib/miscellaneous would fail on machines with non-US locale, so this patch fixes that too.
Stanislaw Osinski made changes - 27/Jul/07 08:05 AM
Here is another (this time let's call it "final"
Stanislaw Osinski made changes - 31/Jul/07 10:05 AM
I took the patch from here (to use jflex for StandardAnalyzer) and
merged it with the patch from to measure the net performance gains. I measure the time to just tokenize all of Wikipedia using With the jflex patch it takes 646 sec (best of 2 runs); when I then A couple other things I noticed:
I tracked down at least some differences between the JavaCC vs JFlex
versions of StandardAnalyzer. I think we should resolve these before committing. I just printed all tokens for the first 20 Wikipedia docs and diff'd Here are the categories of differences that I saw:
OLD: (2004.jpg,34461,34469,type=<HOST>) In this case the old StandardAnalyzer called "2004.jpg" a HOST and
OLD: (62.46,37004,37009,type=<HOST>) The new tokenizer looks right here. I guess the decimal point
OLD: (978-0-94045043-1,86408,86424,type=<NUM>) The new one split off the final "-1" as its own token, and called
OLD: (78academyawards/rules/rule02.html,7194,7227,type=<NUM>) I think the old one is better, though it should not be called a
OLD: (2006-03-11t082958z_01_ban130523_rtridst_0_ozabs,2076,2123,type=<NUM>) Thanks for spotting the differences, I'll add them to the unit tests and will correct the tokenizer accordingly.
One doubt I have is about the filename-like tokens, e.g.: OLD: (2004.jpg,34461,34469,type=<HOST>) To be honest, both variants seem "almost" correct. If you try 2007.org – this is a correct domain name (and has a funny website on it For the JFlex-based tokenizer, I put the <NUM> rule matching first, as it gives some nice performance benefits. We can put HOST first, and then we'll get compliance with the old version. Another option we might consider is:
I'm not sure how this will affect the performance of the tokenizer, but my rough guess is that if we don't come up with very complex/ backtracking-prone rules there should not be too much of a difference. On the other hand, if 100% compatibility with the old tokenizer is a priority, adding new token types is not a good idea, I guess. Finally, when it comes to the initialization time of the new tokenizer – according to the JFlex documentation, some time is required to unpack the transition tables. But the unpacking takes place during the initialization of static fields, so once the class is loaded the overhead should be negligible. I think it is important that we do not modify the StandardAnalyzer and that this improved version has results identical to the original. I believe that the StandardAnalyzer is pretty heavily used – these performance increases could give many users huge gains. By changing this Analyzer at all we will be forcing users to effectively lose terms in their indexes if they want to upgrade core Lucene for use on an older index.
I think any enhanced StandardAnalzyer should probably be released as a new Analyzer. This one should mimic the old though, as the performance gains will be very beneficial to a lot of current users.
I agree, let's try to perfectly match the tokens of the old The speedups of JFlex are amazing: based on a quick test, with JFlex + > Finally, when it comes to the initialization time of the new Yeah I'm baffled why it's that much slower, but on 100 token docs I When I ran under the profiler it was the StandardTokenizerImpl In any event I think it's OK, so long as we get It is important that the same sequence of token text is produced, but I think we could live with different token types in some cases, if we must. Few applications depend on token types, no?
Provided the token text issues can be resolved, I'd like to see StandardTokenizer replaced with this. Performance is important, and ideally folks shouldn't have to change their applications to see performance improvements. When digging deeper into the issues of compatibility with the original StandardAnalyzer, I stumbled upon something strange. Take the following text:
78academyawards/rules/rule02.html,7194,7227,type which was tokenized by the original StandardAnalyzer as one <NUM>. If you look at the definition of the <NUM> token: // every other segment must have at least one digit
you'll see that, as explained in the comment, every other segment must have at least one digit. But actually, according to my understanding, this rule should not match the above text as a whole (and with JFlex it doesn't , actually). Below is the text split by punctuation characters, and it looks like there is no way of splitting this text into alternating segments, every second of which must have a digit (A = ALPHANUM, H = HAS_DIGIT): 78academyawards / rules / rule02 . html , 7194 , 7227 , type
I have no idea why JavaCC matched the whole text as a <NUM>, JFlex behaved "more correctly" here. Now I can see two solutions:
// there must be at least one segment with a digit With this definition, again, all StandardAnalyzer tests pass, plus all texts along the lines of: 2006-03-11t082958z_01_ban130523_rtridst_0_ozabs,2076,2123,type get parsed as a whole as one <NUM>, which is equivalent to what JavaCC-based version would do. I will attach a corresponding patch in a second. A patch for better compatibility with the StandardAnalyzer containing:
I noticed that with this patch org.apache.lucene.benchmark.quality.TestQualityRun.testTrecQuality fails, but I'm not sure if this is related to the tokenizer.
Stanislaw Osinski made changes - 01/Aug/07 08:33 AM
Oddly, the patch for TestStandardAnalyzer failed to apply for me (but
the rest did), so I manually merged those changes in. Oh, I see: it was the "Korean words" test – somehow the characters got mapped to ?'s in your patch. This is why the patch didn't apply, I think? Maybe you used a diffing tool that wasn't happy with unicode or something? I also see the quality test failing in contrib benchmark. I fear KO I re-tested the old vs new StandardAnalyzer on Wikipedia and I OLD NEW I like the NEW behavior better but I fear we should try to match the Here's another one: OLD NEW I like the old behavior better here. Another one: OLD NEW Another one: OLD NEW The search quality test failure can be caused by the standard
analyzer generating different tokens than before. (has nothing to do with token types.) This is because the test's topics (queries) and qrels (expected matches) It is not difficult to update this test for a modified analyzer, but it seems Thanks for more test cases. I guess the biggest problem here is that the scanner generated by JavaCC doesn't seem to strictly follow the specification (see https://issues.apache.org/jira/browse/LUCENE-966#action_12516893
As a side note, it's a shame there's no trace of the version of JavaCC that was used to generate the scanner for the original StandardAnalyzer. I'm also curious if the results of the current JavaCC grammar would be the same with the newest version of the generator (4.0 I guess) – I'll try to check that. Anyway, I'll take a look at the problem in more depth once again. And in the worst case scenario, we can keep the StandardAnalyzer as it was and add the new one next to it so that people can have a choice (on the other hand, this might be a problem for the quality tests). If it really is down to emulating the bugs/oddities in JavaCC then I
think it's not worth polluting the new tokenizer with these legacy bugs, unless one or two cases can match perfectly and not degrade performance too badly? And maybe what we should do is make this a new tokenizer, calling it This way people relying on the precise bugs in JavaCC tokenization are To be precise – I'm not 100% sure that this is a bug in JavaCC (I'll try to browse/ask on their mailing list to find out), but it looks like the scanner generated by JavaCC does not really strictly follow the grammar. I discovered this when you gave the examples of different token produced by JFlex- and JavaCC-based scanners generated from equivalent grammars – JFlex seemed to behave "correctly", hence different tokens.
I was about to start trying to hack the grammar to produce results simiar to StandardAnalyzer (based on a finite set of test cases +1 for deprecating StandardAnalyzer, although it is a little weird to have a StandardAnalyzer2, which is it the standard or not? Maybe we could call it the DefaultAnalyzer or something like that. I never much cared for tacking on numbers on the end of class names.
I agree with Mike M's last comment and approach, though, if that proves to be the case for the differences. I like the name DefaultAnalyzer.
These issues seem odd.
Both JavaCC and Flex match with the same rules: OLD NEW The old is correct and the NEW should not match <NUM>. <NUM> should break on '/' and '.' and every other token from the break should have a digit for a NUM match to occur. This is not the case. OLD NEW Something is wrong with the NEW one. <NUM> is certainly a valid longer match. OLD NEW Again, something seems wrong with the NEW. (safari-0-sheikh,12011,12026,type=<NUM>) is a correct and longer match than (safari,12011,12017,type=<ALPHANUM>) It would be nice to have the source text for these comparisons. Also, a hard vote against StandardAnalyzer2 <g> Default is arguable as well, as this wouldn't be the default analyzer you should use in many cases (don't like standard because of that either). From the latest samples, I would say something is off with the NEW and OLD appears mostly correct.
By the way...you can see one of the issues here:
OLD NEW JavaCC StandardAnalyzer would never output a token that starts with a '/'. It would be cut off. The issues may be involved with how JFlex skips characters that are not part of a match compared to how JavaCC is doing it. Or perhaps the JFlex version is considering '/' and '.' to be ALPHANUM's.
>When digging deeper into the issues of compatibility with the original StandardAnalyzer, I stumbled upon something strange. Take the following >text:
> >78academyawards/rules/rule02.html,7194,7227,type > >which was tokenized by the original StandardAnalyzer as one <NUM>. If you look at the definition of the <NUM> token: > >// every other segment must have at least one digit ><NUM: (<ALPHANUM> <P> <HAS_DIGIT> > | <HAS_DIGIT> <P> <ALPHANUM> > | <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+ > | <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+ > | <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+ > | <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+ > ) > >you'll see that, as explained in the comment, every other segment must have at least one digit. But actually, according to my understanding, this >rule should not match the above text as a whole (and with JFlex it doesn't , actually). Below is the text split by punctuation characters, and it looks >like there is no way of splitting this text into alternating segments, every second of which must have a digit (A = ALPHANUM, H = HAS_DIGIT): > >78academyawards / rules / rule02 . html , 7194 , 7227 , type > H P A P H P A P H P A P H?* (starting from the beginning) > H?* P A P H P A (starting from the end) > >* (would have to be H, but no digits in substring "type" or "html") > >I have no idea why JavaCC matched the whole text as a <NUM>, JFlex behaved "more correctly" here. I think that JavaCC is correct here. Every other segment must have a digit: 78academyawards / rules / rule02 . html Okkkk – only now I realized I made a really silly mistake
One more important clarification – the tokenizer from the last patch (jflex-analyzer-r561693-compatibility.txt) has a completely different definition of the <NUM> token – it allows digits in any segment, hence the totally different results. If we want to be compatible with the StandardAnalyzer, we should forget about that patch. Mark – have you tried the jflex-analyzer-r560135-patch.txt patch with your wikipedia diff test? That's the early one whose grammar was "dot for dot" translated from the original JavaCC spec – for further patches I did some "optimizations", which seem to have broken the compatibility... Incidentally, what was the motivation for requiring the <NUM> token to have numbers only in every second segment and not in any segment? I've attached another patch with:
I hope this time the patch correctly applies to the test class (previously I used Eclipse to create the patch which may not have been a good idea). All unit tests pass, so I hope this time I finally got it right.
Stanislaw Osinski made changes - 03/Aug/07 08:19 AM
The TestStandardAnalyzer.java patch still fails to apply – looks like
the same thing as before (the unicode chars under "Korean words" were replaced by ?'s somewhere along the way). And, somehow, many of the files are dup'd (appear twice) in the > The TestStandardAnalyzer.java patch still fails to apply – looks like
> the same thing as before (the unicode chars under "Korean words" were > replaced by ?'s somewhere along the way). WOOPS! This was my bad. I was clicking on the patch and then Sorry for the confusion. This time I used Tortoise, but it made things worse
One more try – this time without duplicated entries (no idea why Tortoise did this...).
Stanislaw Osinski made changes - 03/Aug/07 12:37 PM
Super, I will compare tokens on Wikipedia...
First 1000 documents in wikipedia show absolutely no difference – I think this one is it!
Good news – thanks for the test!
OK I can commit this. Does anyone see any reason not to? I will wait a few days then commit...
Thank you Stanislaw! I was gearing up to commit this, but realized the copyright on top of /*
I believe (according to http://www.apache.org/legal/src-headers.html I would also add the Apache header into StandardAnalyzerImpl.jflex. Absolutely – the header was there only because I used Carrot2's grammar to start with and forgot to remove it. You can put the Apache header on all the files, NOTICE.txt is not necessary either. Apologies for confusion
OK I committed this! Thank you Stanislaw!
I ran a quick perf test on Wikipedia (first 50K docs only) and found I made these small additional changes over the final patch before
Michael McCandless made changes - 08/Aug/07 10:29 PM
> * I removed StandardAnalyzer.html "grammar doc" generation from
> build.xml since it was using jjdoc. Stanislaw, is there something > in jflex that can generated a BNF description of the grammar as > HTML? I've had a look at the JFlex docs and there doesn't seem to be such a tool for JFlex, I'm afraid. > > * I removed StandardAnalyzer.html "grammar doc" generation from
> > build.xml since it was using jjdoc. Stanislaw, is there something > > > in jflex that can generated a BNF description of the grammar as > > > HTML? > I've had a look at the JFlex docs and there doesn't seem to be such OK, I think we can just live without this. Thanks!
Michael Busch made changes - 25/Jan/08 03:24 AM
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Because FastAnalyzer shares some code with StandardAnalyzer, it might be worthwhile to do some refactorings: