Issue Details (XML | Word | Printable)

Key: LUCENE-966
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Unassigned
Reporter: Stanislaw Osinski
Votes: 2
Watchers: 2
Operations

If you were logged in you would be able to see more operations.
Lucene - Java

A faster JFlex-based replacement for StandardAnalyzer

Created: 26/Jul/07 01:04 PM   Updated: 25/Jan/08 03:24 AM
Return to search
Component/s: Analysis
Affects Version/s: None
Fix Version/s: 2.3

Time Tracking:
Not Specified

File Attachments:
  Size
Java Source File AnalyzerBenchmark.java 2007-07-26 01:15 PM Stanislaw Osinski 4 kB
Text File Licensed for inclusion in ASF works jflex-analyzer-patch.txt 2007-07-26 01:12 PM Stanislaw Osinski 52 kB
Text File Licensed for inclusion in ASF works jflex-analyzer-r560135-patch.txt 2007-07-27 08:05 AM Stanislaw Osinski 67 kB
Text File Licensed for inclusion in ASF works jflex-analyzer-r561292-patch.txt 2007-07-31 10:05 AM Stanislaw Osinski 143 kB
Text File Licensed for inclusion in ASF works jflex-analyzer-r561693-compatibility.txt 2007-08-01 08:33 AM Stanislaw Osinski 132 kB
Text File Licensed for inclusion in ASF works jflex-analyzer-r562378-patch-nodup.txt 2007-08-03 12:37 PM Stanislaw Osinski 133 kB
Text File Licensed for inclusion in ASF works jflex-analyzer-r562378-patch.txt 2007-08-03 08:19 AM Stanislaw Osinski 236 kB

Lucene Fields: New, Patch Available
Resolution Date: 08/Aug/07 10:29 PM


 Description  « Hide
JFlex (http://www.jflex.de/) can be used to generate a faster (up to several times) replacement for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Stanislaw Osinski added a comment - 26/Jul/07 01:12 PM
Here comes a somewhat rough (needing refactorings, see below) patch adding a JFlex-based repalcement for StandardAnalyzer called FastAnalyzer. A few comments:
  • all tests from TestStandardAnalyzer pass
  • tokenizer generation added to Ant build scripts

Because FastAnalyzer shares some code with StandardAnalyzer, it might be worthwhile to do some refactorings:

  • have a common test for both analyzers (only if we want to keep them in sync)
  • have a common superclass for FastFilter and StandardFilter
  • have a common superclass for FastAnalyzer and StandardAnalyzer

Stanislaw Osinski made changes - 26/Jul/07 01:12 PM
Field Original Value New Value
Attachment jflex-analyzer-patch.txt [ 12362606 ]
Stanislaw Osinski added a comment - 26/Jul/07 01:15 PM
Here is a very simple benchmark I used to test the performance of StandardAnalyzer, FastAnalyzer and WhitespaceAnalyzer. I ran it on a number of JVMs and got the following results:

Input: Reuters collection, the one used by contrib/benchmark, only documents
longer than 100 bytes

Machine: AMD Sempron 2600+, 2G RAM, Windows XP

Sun 1.4.2 Server
org.apache.lucene.analysis.standard.StandardAnalyzer: 15172 ms, 139667 tokens/s
org.apache.lucene.analysis.fast.FastAnalyzer: 2438 ms, 869170 tokens/s
org.apache.lucene.analysis.WhitespaceAnalyzer: 781 ms, 3547585 tokens/s

Sun 1.4.2 Client
org.apache.lucene.analysis.standard.StandardAnalyzer: 24187 ms, 87610 tokens/s
org.apache.lucene.analysis.fast.FastAnalyzer: 3157 ms, 671218 tokens/s
org.apache.lucene.analysis.WhitespaceAnalyzer: 1453 ms, 1906857 tokens/s

Sun 1.5.0 Server
org.apache.lucene.analysis.standard.StandardAnalyzer: 16062 ms, 131928 tokens/s
org.apache.lucene.analysis.fast.FastAnalyzer: 2641 ms, 802361 tokens/s
org.apache.lucene.analysis.WhitespaceAnalyzer: 750 ms, 3694218 tokens/s

Sun 1.5.0 Client
org.apache.lucene.analysis.standard.StandardAnalyzer: 23891 ms, 88696 tokens/s
org.apache.lucene.analysis.fast.FastAnalyzer: 3641 ms, 581993 tokens/s
org.apache.lucene.analysis.WhitespaceAnalyzer: 1437 ms, 1928089 tokens/s

Sun 1.6.0 Server
org.apache.lucene.analysis.standard.StandardAnalyzer: 13719 ms, 154460 tokens/s
org.apache.lucene.analysis.fast.FastAnalyzer: 2484 ms, 853074 tokens/s
org.apache.lucene.analysis.WhitespaceAnalyzer: 750 ms, 3694218 tokens/s

Sun 1.6.0 Client
org.apache.lucene.analysis.standard.StandardAnalyzer: 22312 ms, 94972 tokens/s
org.apache.lucene.analysis.fast.FastAnalyzer: 2750 ms, 770558 tokens/s
org.apache.lucene.analysis.WhitespaceAnalyzer: 1297 ms, 2136209 tokens/s

IBM 1.4.2
org.apache.lucene.analysis.standard.StandardAnalyzer: 11922 ms, 177741 tokens/s
org.apache.lucene.analysis.fast.FastAnalyzer: 3218 ms, 658495 tokens/s
org.apache.lucene.analysis.WhitespaceAnalyzer: 1407 ms, 1969199 tokens/s

IBM 1.5.0
org.apache.lucene.analysis.standard.StandardAnalyzer: 11797 ms, 179625 tokens/s
org.apache.lucene.analysis.fast.FastAnalyzer: 2968 ms, 713961 tokens/s
org.apache.lucene.analysis.WhitespaceAnalyzer: 1000 ms, 2770664 tokens/s

BEA 1.4.2
org.apache.lucene.analysis.standard.StandardAnalyzer: 16234 ms, 130530 tokens/s
org.apache.lucene.analysis.fast.FastAnalyzer: 3344 ms, 633683 tokens/s
org.apache.lucene.analysis.WhitespaceAnalyzer: 1343 ms, 2063040 tokens/s

BEA 1.5.0 (looks really slow)
org.apache.lucene.analysis.standard.StandardAnalyzer: 33891 ms, 62525 tokens/s
org.apache.lucene.analysis.fast.FastAnalyzer: 12703 ms, 166813 tokens/s
org.apache.lucene.analysis.WhitespaceAnalyzer: 4860 ms, 570095 tokens/s


Stanislaw Osinski made changes - 26/Jul/07 01:15 PM
Attachment AnalyzerBenchmark.java [ 12362607 ]
Yonik Seeley added a comment - 26/Jul/07 03:05 PM
Thanks Staszek, very nice!
If FastTokenizer is compatible with StandardTokenizer, it should probably just replace it.
Have you tried this replacement and running all the lucene tests?

Mark Miller added a comment - 26/Jul/07 04:35 PM
Great patch! And a very quick turnaround to boot!

I am seeing a HUGE speed increase!

I also think this should certainly become the new StandardAnalyzer.

I am running into only one issue...when rebuilding, after a clean, FastTokenizer does not seem to get generated – FastTokenizerImpl is there, but no FastTokenizer interface; it appears to be wiped out in the clean but not regenerated in the build. If I put it in manually, everything works great.

  • Mark

Paul Smith added a comment - 26/Jul/07 09:49 PM
We did pretty much the same thing here at Aconex, The tokenization mechanism in the old javacc-based analyser is woeful compared to what JFlex outputs.

Nice work!


Stanislaw Osinski added a comment - 27/Jul/07 08:05 AM
Here's another patch (against r560135) which:
  • replaces JavaCC-based StandardTokenizer with a JFlex-based one (all tests pass), as suggested by Yonik
  • updates build scripts accordingly and fixes the issue of deleting one file to many, as reported by Mark

Additionally, I noticed that ChainedFilterTest from contrib/miscellaneous would fail on machines with non-US locale, so this patch fixes that too.


Stanislaw Osinski made changes - 27/Jul/07 08:05 AM
Attachment jflex-analyzer-r560135-patch.txt [ 12362672 ]
Stanislaw Osinski added a comment - 31/Jul/07 10:05 AM
Here is another (this time let's call it "final" patch for this issue which comes with a simplified (but test-wise equivalent) grammar for numeric expressions that gives about 5% performance increase.

Stanislaw Osinski made changes - 31/Jul/07 10:05 AM
Attachment jflex-analyzer-r561292-patch.txt [ 12362865 ]
Michael McCandless added a comment - 31/Jul/07 05:12 PM
I took the patch from here (to use jflex for StandardAnalyzer) and
merged it with the patch from LUCENE-969 (re-use Token & TokenStream)
to measure the net performance gains.

I measure the time to just tokenize all of Wikipedia using
StandardAnalyzer using contrib/benchmark plus patch from LUCENE-967
(test details are described in LUCENE-969).

With the jflex patch it takes 646 sec (best of 2 runs); when I then
merge in the patch from LUCENE-969 it takes 455 sec. Subtracting off
the time to just load all Wikipedia docs (= 112 sec) that gives net
additional speedup of 36% (534 sec -> 343 sec) when using LUCENE-969
in addition to jflex.

A couple other things I noticed:

  • The init cost of jflex (StandardTokenizerImpl) seems to be fairly
    high: when I repeat the above test with smallish docs (100 tokens
    each) instead, the gain is around 84%. I think this just makes
    the new reusableTokenStream() in LUCENE-969 important to commit.
  • I'm seeing differing token counts with the jflex StandardAnalyzer
    vs the current one; I think there is some difference here. I will
    track down which tokens differ and post back...

Michael McCandless added a comment - 31/Jul/07 05:36 PM
I tracked down at least some differences between the JavaCC vs JFlex
versions of StandardAnalyzer.

I think we should resolve these before committing.

I just printed all tokens for the first 20 Wikipedia docs and diff'd
the outputs.

Here are the categories of differences that I saw:

  • Only the type differs on a filename-like token:

OLD: (2004.jpg,34461,34469,type=<HOST>)
NEW: (2004.jpg,34461,34469,type=<NUM>)

In this case the old StandardAnalyzer called "2004.jpg" a HOST and
the new one calls it a NUM. Seems like neither one is right!

  • Only the type differs on a number token:

OLD: (62.46,37004,37009,type=<HOST>)
NEW: (62.46,37004,37009,type=<NUM>)

The new tokenizer looks right here. I guess the decimal point
confuses the JavaCC (old) one.

  • Different number of tokens produced for number-like-token:

OLD: (978-0-94045043-1,86408,86424,type=<NUM>)
NEW: (978-0-94045043,86408,86422,type=<NUM>)
(1,86423,86424,type=<ALPHANUM>)

The new one split off the final "-1" as its own token, and called
it ALPHANUM not NUM. I think the old behavior is correct.

  • Different number of tokens produced for filename:

OLD: (78academyawards/rules/rule02.html,7194,7227,type=<NUM>)
NEW: (78academyawards/rules/rule02,7194,7222,type=<NUM>)
(html,7223,7227,type=<ALPHANUM>)

I think the old one is better, though it should not be called a
NUM (maybe we need a new "FILENAME" token type?).

  • Same as above, but split on final '_' instead of '.' ('-' also
    shows this behavior):

OLD: (2006-03-11t082958z_01_ban130523_rtridst_0_ozabs,2076,2123,type=<NUM>)
new: (2006-03-11t082958z_01_ban130523_rtridst_0,2076,2117,type=<NUM>)
(ozabs,2118,2123,type=<ALPHANUM>)


Stanislaw Osinski added a comment - 31/Jul/07 06:14 PM
Thanks for spotting the differences, I'll add them to the unit tests and will correct the tokenizer accordingly.

One doubt I have is about the filename-like tokens, e.g.:

OLD: (2004.jpg,34461,34469,type=<HOST>)
NEW: (2004.jpg,34461,34469,type=<NUM>)

To be honest, both variants seem "almost" correct. If you try 2007.org – this is a correct domain name (and has a funny website on it , so, given the fact that we don't check for typical suffixes, such as ".com", <HOST> doesn't seem wrong. On the other hand, 2004.jpg may well have been some sort of numerical code or a product number, so <NUM> is not totally irrelevant either.

For the JFlex-based tokenizer, I put the <NUM> rule matching first, as it gives some nice performance benefits. We can put HOST first, and then we'll get compliance with the old version.

Another option we might consider is:

  • adding a new token type for file names (get a list of common extensions or even assume that an extension is simply three alphanumerical characters)
  • checking for common domain names in hosts (something along the lines of: "mil" | "info" | "gov" | "edu" | "biz" | "com" | "org" | "net" | "arpa" | {LETTER}{2})

I'm not sure how this will affect the performance of the tokenizer, but my rough guess is that if we don't come up with very complex/ backtracking-prone rules there should not be too much of a difference. On the other hand, if 100% compatibility with the old tokenizer is a priority, adding new token types is not a good idea, I guess.

Finally, when it comes to the initialization time of the new tokenizer – according to the JFlex documentation, some time is required to unpack the transition tables. But the unpacking takes place during the initialization of static fields, so once the class is loaded the overhead should be negligible.


Mark Miller added a comment - 31/Jul/07 06:36 PM
I think it is important that we do not modify the StandardAnalyzer and that this improved version has results identical to the original. I believe that the StandardAnalyzer is pretty heavily used – these performance increases could give many users huge gains. By changing this Analyzer at all we will be forcing users to effectively lose terms in their indexes if they want to upgrade core Lucene for use on an older index.

I think any enhanced StandardAnalzyer should probably be released as a new Analyzer. This one should mimic the old though, as the performance gains will be very beneficial to a lot of current users.

  • Mark

Michael McCandless added a comment - 31/Jul/07 07:03 PM

I agree, let's try to perfectly match the tokens of the old
StandardAnalyzer so we have a way-faster drop-in replacement.

The speedups of JFlex are amazing: based on a quick test, with JFlex +
patch from LUCENE-969, the new StandardAnalyzer is only 2.09X slower
than WhitespaceAnalyzer even though it's doing so much more ...

> Finally, when it comes to the initialization time of the new
> tokenizer – according to the JFlex documentation, some time is
> required to unpack the transition tables. But the unpacking takes
> place during the initialization of static fields, so once the class
> is loaded the overhead should be negligible.

Yeah I'm baffled why it's that much slower, but on 100 token docs I
definitely see LUCENE-969 making things 84% faster but "only" 36%
faster if I use the full Wikipedia doc (which are much larger than 100
tokens on average). If we tested even smaller docs I think the gains
would be even more.

When I ran under the profiler it was the StandardTokenizerImpl
<init>(java.io.Reader) way on the top. Maybe it's the cost of new'ing
the 16 KB buffer each time?

In any event I think it's OK, so long as we get LUCENE-969 in, and
document the importance of using reusableTokenStream() API for better
performance.


Doug Cutting added a comment - 31/Jul/07 07:03 PM
It is important that the same sequence of token text is produced, but I think we could live with different token types in some cases, if we must. Few applications depend on token types, no?

Provided the token text issues can be resolved, I'd like to see StandardTokenizer replaced with this. Performance is important, and ideally folks shouldn't have to change their applications to see performance improvements.


Stanislaw Osinski added a comment - 01/Aug/07 07:53 AM
When digging deeper into the issues of compatibility with the original StandardAnalyzer, I stumbled upon something strange. Take the following text:

78academyawards/rules/rule02.html,7194,7227,type

which was tokenized by the original StandardAnalyzer as one <NUM>. If you look at the definition of the <NUM> token:

// every other segment must have at least one digit
<NUM: (<ALPHANUM> <P> <HAS_DIGIT>

<HAS_DIGIT> <P> <ALPHANUM>
<ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
<HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
<ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
<HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
)

you'll see that, as explained in the comment, every other segment must have at least one digit. But actually, according to my understanding, this rule should not match the above text as a whole (and with JFlex it doesn't , actually). Below is the text split by punctuation characters, and it looks like there is no way of splitting this text into alternating segments, every second of which must have a digit (A = ALPHANUM, H = HAS_DIGIT):

78academyawards / rules / rule02 . html , 7194 , 7227 , type
H P A P H P A P H P A P H?* (starting from the beginning)
H?* P A P H P A (starting from the end)

  • (would have to be H, but no digits in substring "type" or "html")

I have no idea why JavaCC matched the whole text as a <NUM>, JFlex behaved "more correctly" here.

Now I can see two solutions:

  • try to patch the JFlex grammar to emulate JavaCC quirks (though I may not be aware of most of them...)
  • relax the <NUM> rule a little bit (JFlex notation):

// there must be at least one segment with a digit
NUM = ({P} ({HAS_DIGIT} | {ALPHANUM}))* {HAS_DIGIT} ({P} ({HAS_DIGIT} | {ALPHANUM}))*

With this definition, again, all StandardAnalyzer tests pass, plus all texts along the lines of:

2006-03-11t082958z_01_ban130523_rtridst_0_ozabs,2076,2123,type
78academyawards/rules/rule02.html,7194,7227,type
978-0-94045043-1,86408,86424,type
62.46,37004,37009,type (this one was parsed as <HOST> by the original analyzer)

get parsed as a whole as one <NUM>, which is equivalent to what JavaCC-based version would do. I will attach a corresponding patch in a second.


Stanislaw Osinski added a comment - 01/Aug/07 08:33 AM
A patch for better compatibility with the StandardAnalyzer containing:
  • relaxed definition of the <NUM> token
  • new test cases in TestStandardAnalyzer

I noticed that with this patch org.apache.lucene.benchmark.quality.TestQualityRun.testTrecQuality fails, but I'm not sure if this is related to the tokenizer.


Stanislaw Osinski made changes - 01/Aug/07 08:33 AM
Michael McCandless added a comment - 01/Aug/07 03:50 PM
Oddly, the patch for TestStandardAnalyzer failed to apply for me (but
the rest did), so I manually merged those changes in. Oh, I see: it
was the "Korean words" test – somehow the characters got mapped to
?'s in your patch. This is why the patch didn't apply, I think?
Maybe you used a diffing tool that wasn't happy with unicode or
something?

I also see the quality test failing in contrib benchmark. I fear
something about the new StandardAnalyzer is in fact causing this test
to fail (it passes on a clean checkout). That test uses
StandardAnalyzer.

KO I re-tested the old vs new StandardAnalyzer on Wikipedia and I
still found some differences, I think only on these very large
URL-like tokens. Here's one:

OLD
(money.cnn.com,1382,1395,type=<HOST>)
(magazines,1396,1405,type=<ALPHANUM>)
(fortune,1406,1413,type=<ALPHANUM>)
(fortune,1414,1421,type=<ALPHANUM>)
(archive/2007/03/19/8402357,1422,1448,type=<NUM>)
(index.htm,1449,1458,type=<HOST>)

NEW
(/money.cnn.com/magazines/fortune/fortune_archive/2007/03/19/8402357/index.htm,1381,1458,type=<NUM>)

I like the NEW behavior better but I fear we should try to match the
old one?

Here's another one:

OLD
(mid-20th,2436,2444,type=<NUM>)

NEW
(mid,2436,2439,type=<ALPHANUM>)
(-20th,2439,2444,type=<NUM>)

I like the old behavior better here.

Another one:

OLD
(safari-0-sheikh,12011,12026,type=<NUM>)
(zayed,12027,12032,type=<ALPHANUM>)
(grand,12033,12038,type=<ALPHANUM>)
(mosque.jpg,12039,12049,type=<HOST>)

NEW
(safari,12011,12017,type=<ALPHANUM>)
(0-sheikh-zayed-grand-mosque.jpg,12018,12049,type=<NUM>)

Another one:

OLD
(semitica-01.png,616,631,type=<NUM>)

NEW
(-semitica-01.png,615,631,type=<NUM>)


Doron Cohen added a comment - 01/Aug/07 10:23 PM
The search quality test failure can be caused by the standard
analyzer generating different tokens than before. (has nothing
to do with token types.)

This is because the test's topics (queries) and qrels (expected matches)
were created by examining an index that was created using the current
standard analyzer. Now, running this test with an analyzer that creates
other tokens is likely to fail.

It is not difficult to update this test for a modified analyzer, but it seems
better to me to preserve the original standard analyzer behavior.


Stanislaw Osinski added a comment - 02/Aug/07 08:44 AM
Thanks for more test cases. I guess the biggest problem here is that the scanner generated by JavaCC doesn't seem to strictly follow the specification (see https://issues.apache.org/jira/browse/LUCENE-966#action_12516893), so I'd need to emulate possible JavaCC "bugs" I'm not aware of at the moment (I'm not an expert on lexical scanner generation either, not yet at least . I can add some workarounds to the grammar to make the known incompatibility examples work, but this won't guarantee consistency in general.

As a side note, it's a shame there's no trace of the version of JavaCC that was used to generate the scanner for the original StandardAnalyzer. I'm also curious if the results of the current JavaCC grammar would be the same with the newest version of the generator (4.0 I guess) – I'll try to check that.

Anyway, I'll take a look at the problem in more depth once again. And in the worst case scenario, we can keep the StandardAnalyzer as it was and add the new one next to it so that people can have a choice (on the other hand, this might be a problem for the quality tests).


Michael McCandless added a comment - 02/Aug/07 11:56 AM
If it really is down to emulating the bugs/oddities in JavaCC then I
think it's not worth polluting the new tokenizer with these legacy
bugs, unless one or two cases can match perfectly and not degrade
performance too badly?

And maybe what we should do is make this a new tokenizer, calling it
StandardAnalyzer2, and then deprecate the existing StandardAnalyzer?
Then remove any & all JavaCC bug emulation from the new one.

This way people relying on the precise bugs in JavaCC tokenization are
not hurt on upgrading to 2.3 and are given a chance to migrate to the
new one (with 1 release of deprecated StandardAnalyzer). And new
people will use the new faster one.


Stanislaw Osinski added a comment - 02/Aug/07 12:09 PM
To be precise – I'm not 100% sure that this is a bug in JavaCC (I'll try to browse/ask on their mailing list to find out), but it looks like the scanner generated by JavaCC does not really strictly follow the grammar. I discovered this when you gave the examples of different token produced by JFlex- and JavaCC-based scanners generated from equivalent grammars – JFlex seemed to behave "correctly", hence different tokens.

I was about to start trying to hack the grammar to produce results simiar to StandardAnalyzer (based on a finite set of test cases . I'll see how it compares performance-wise too.


Grant Ingersoll added a comment - 02/Aug/07 06:55 PM
+1 for deprecating StandardAnalyzer, although it is a little weird to have a StandardAnalyzer2, which is it the standard or not? Maybe we could call it the DefaultAnalyzer or something like that. I never much cared for tacking on numbers on the end of class names.

I agree with Mike M's last comment and approach, though, if that proves to be the case for the differences.


Michael McCandless added a comment - 02/Aug/07 07:08 PM
I like the name DefaultAnalyzer.

Mark Miller added a comment - 02/Aug/07 07:32 PM
These issues seem odd.

Both JavaCC and Flex match with the same rules:
1. Longest match first
2. If match size is the same, use the first in the grammar

OLD
(money.cnn.com,1382,1395,type=<HOST>)
(magazines,1396,1405,type=<ALPHANUM>)
(fortune,1406,1413,type=<ALPHANUM>)
(fortune,1414,1421,type=<ALPHANUM>)
(archive/2007/03/19/8402357,1422,1448,type=<NUM>)
(index.htm,1449,1458,type=<HOST>)

NEW
(/money.cnn.com/magazines/fortune/fortune_archive/2007/03/19/8402357/index.htm,1381,1458,type=<NUM>)

The old is correct and the NEW should not match <NUM>. <NUM> should break on '/' and '.' and every other token from the break should have a digit for a NUM match to occur. This is not the case.

OLD
(mid-20th,2436,2444,type=<NUM>)

NEW
(mid,2436,2439,type=<ALPHANUM>)
(-20th,2439,2444,type=<NUM>)

Something is wrong with the NEW one. <NUM> is certainly a valid longer match.

OLD
(safari-0-sheikh,12011,12026,type=<NUM>)
(zayed,12027,12032,type=<ALPHANUM>)
(grand,12033,12038,type=<ALPHANUM>)
(mosque.jpg,12039,12049,type=<HOST>)

NEW
(safari,12011,12017,type=<ALPHANUM>)
(0-sheikh-zayed-grand-mosque.jpg,12018,12049,type=<NUM>)

Again, something seems wrong with the NEW. (safari-0-sheikh,12011,12026,type=<NUM>) is a correct and longer match than (safari,12011,12017,type=<ALPHANUM>)

It would be nice to have the source text for these comparisons.

Also, a hard vote against StandardAnalyzer2 <g> Default is arguable as well, as this wouldn't be the default analyzer you should use in many cases (don't like standard because of that either).

From the latest samples, I would say something is off with the NEW and OLD appears mostly correct.

  • Mark

Mark Miller added a comment - 02/Aug/07 08:02 PM
By the way...you can see one of the issues here:

OLD
(money.cnn.com,1382,1395,type=<HOST>)
(magazines,1396,1405,type=<ALPHANUM>)
(fortune,1406,1413,type=<ALPHANUM>)
(fortune,1414,1421,type=<ALPHANUM>)
(archive/2007/03/19/8402357,1422,1448,type=<NUM>)
(index.htm,1449,1458,type=<HOST>)

NEW
(/money.cnn.com/magazines/fortune/fortune_archive/2007/03/19/8402357/index.htm,1381,1458,type=<NUM>)

JavaCC StandardAnalyzer would never output a token that starts with a '/'. It would be cut off. The issues may be involved with how JFlex skips characters that are not part of a match compared to how JavaCC is doing it. Or perhaps the JFlex version is considering '/' and '.' to be ALPHANUM's.

  • Mark

Mark Miller added a comment - 02/Aug/07 08:42 PM
>When digging deeper into the issues of compatibility with the original StandardAnalyzer, I stumbled upon something strange. Take the following >text:
>
>78academyawards/rules/rule02.html,7194,7227,type
>
>which was tokenized by the original StandardAnalyzer as one <NUM>. If you look at the definition of the <NUM> token:
>
>// every other segment must have at least one digit
><NUM: (<ALPHANUM> <P> <HAS_DIGIT>
> | <HAS_DIGIT> <P> <ALPHANUM>
> | <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
> | <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
> | <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
> | <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
> )
>
>you'll see that, as explained in the comment, every other segment must have at least one digit. But actually, according to my understanding, this >rule should not match the above text as a whole (and with JFlex it doesn't , actually). Below is the text split by punctuation characters, and it looks >like there is no way of splitting this text into alternating segments, every second of which must have a digit (A = ALPHANUM, H = HAS_DIGIT):
>
>78academyawards / rules / rule02 . html , 7194 , 7227 , type
> H P A P H P A P H P A P H?* (starting from the beginning)
> H?* P A P H P A (starting from the end)
>
>* (would have to be H, but no digits in substring "type" or "html")
>
>I have no idea why JavaCC matched the whole text as a <NUM>, JFlex behaved "more correctly" here.

I think that JavaCC is correct here. Every other segment must have a digit: 78academyawards / rules / rule02 . html
It looks to me like every other segment does have a digit. First segment has a digit, second does not, third does, fourth does not. Matches NUM correctly.


Stanislaw Osinski added a comment - 02/Aug/07 09:08 PM
Okkkk – only now I realized I made a really silly mistake When using Mark's examples I somehow took the ",type" substring as part of the token image, which made the JavaCC tokenizer look "buggy"... Apologies for the confusion, tomorrow in the morning I'll correct my tests and will see what's happening.

One more important clarification – the tokenizer from the last patch (jflex-analyzer-r561693-compatibility.txt) has a completely different definition of the <NUM> token – it allows digits in any segment, hence the totally different results. If we want to be compatible with the StandardAnalyzer, we should forget about that patch.

Mark – have you tried the jflex-analyzer-r560135-patch.txt patch with your wikipedia diff test? That's the early one whose grammar was "dot for dot" translated from the original JavaCC spec – for further patches I did some "optimizations", which seem to have broken the compatibility...

Incidentally, what was the motivation for requiring the <NUM> token to have numbers only in every second segment and not in any segment?


Stanislaw Osinski added a comment - 03/Aug/07 08:19 AM
I've attached another patch with:
  • grammar translated "dot for dot" from JavaCC (no "optimizations" this time)
  • tests extended with Mark's latest compatibility examples

I hope this time the patch correctly applies to the test class (previously I used Eclipse to create the patch which may not have been a good idea). All unit tests pass, so I hope this time I finally got it right.


Stanislaw Osinski made changes - 03/Aug/07 08:19 AM
Attachment jflex-analyzer-r562378-patch.txt [ 12363108 ]
Michael McCandless added a comment - 03/Aug/07 12:28 PM
The TestStandardAnalyzer.java patch still fails to apply – looks like
the same thing as before (the unicode chars under "Korean words" were
replaced by ?'s somewhere along the way).

And, somehow, many of the files are dup'd (appear twice) in the
patch (eg StandardTokenizerImpl.jflex, StandardFilter.java, etc.).
I'm not sure whether first or 2nd one is the one to keep.


Michael McCandless added a comment - 03/Aug/07 12:33 PM
> The TestStandardAnalyzer.java patch still fails to apply – looks like
> the same thing as before (the unicode chars under "Korean words" were
> replaced by ?'s somewhere along the way).

WOOPS! This was my bad. I was clicking on the patch and then
copy/pasting the text out of the browser into a local file, and this
step (ugh) introduces the ?'s. If I instead right-click on the link
and have browser directly save the file then it works perfectly!

Sorry for the confusion.


Stanislaw Osinski added a comment - 03/Aug/07 12:34 PM
This time I used Tortoise, but it made things worse I'll create the patch the usual way and e-mail it directly to you to see if it's not JIRA's fault (I don't see the '?' characters on my local machine).

Stanislaw Osinski added a comment - 03/Aug/07 12:37 PM
One more try – this time without duplicated entries (no idea why Tortoise did this...).

Stanislaw Osinski made changes - 03/Aug/07 12:37 PM
Attachment jflex-analyzer-r562378-patch-nodup.txt [ 12363116 ]
Michael McCandless added a comment - 03/Aug/07 12:43 PM
Super, I will compare tokens on Wikipedia...

Michael McCandless added a comment - 03/Aug/07 02:12 PM
First 1000 documents in wikipedia show absolutely no difference – I think this one is it!

Stanislaw Osinski added a comment - 03/Aug/07 04:51 PM
Good news – thanks for the test!

Michael McCandless added a comment - 03/Aug/07 04:55 PM
OK I can commit this. Does anyone see any reason not to? I will wait a few days then commit...

Thank you Stanislaw!


Michael McCandless added a comment - 08/Aug/07 03:01 PM

I was gearing up to commit this, but realized the copyright on top of
StandardTokenizer.java isn't the standard Apache header; it's this
one:

/*

  • Carrot2 project.
    *
  • Copyright (C) 2002-2007, Dawid Weiss, Stanisław Osiński.
  • Portions (C) Contributors listed in "carrot2.CONTRIBUTORS" file.
  • All rights reserved.
    *
  • Refer to the full license file "carrot2.LICENSE"
  • in the root folder of the repository checkout or at:
  • http://www.carrot2.org/carrot2.LICENSE
    */

I believe (according to http://www.apache.org/legal/src-headers.html),
we need to replace the above with the standard Apache header and then
maybe move this one into the NOTICE.txt. Is that right? Stanislaw,
is that OK?

I would also add the Apache header into StandardAnalyzerImpl.jflex.


Stanislaw Osinski added a comment - 08/Aug/07 05:50 PM
Absolutely – the header was there only because I used Carrot2's grammar to start with and forgot to remove it. You can put the Apache header on all the files, NOTICE.txt is not necessary either. Apologies for confusion

Michael McCandless added a comment - 08/Aug/07 05:54 PM
Super, thanks!

Repository Revision Date User Message
ASF #564036 Wed Aug 08 22:26:44 UTC 2007 mikemccand LUCENE-966: sizable (~6X faster) speedups to StandardTokenizer by using JFlex instead of JavaCC
Files Changed
ADD /lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex
MODIFY /lucene/java/trunk/contrib/miscellaneous/src/test/org/apache/lucene/misc/ChainedFilterTest.java
MODIFY /lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/package.html
DEL /lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.jj
ADD /lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.java
MODIFY /lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java
DEL /lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/Token.java
DEL /lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/ParseException.java
MODIFY /lucene/java/trunk/src/test/org/apache/lucene/analysis/TestStandardAnalyzer.java
DEL /lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/CharStream.java
MODIFY /lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardFilter.java
DEL /lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizerConstants.java
MODIFY /lucene/java/trunk/build.xml
DEL /lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/FastCharStream.java
DEL /lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizerTokenManager.java
DEL /lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/TokenMgrError.java
MODIFY /lucene/java/trunk/CHANGES.txt
MODIFY /lucene/java/trunk/common-build.xml

Michael McCandless added a comment - 08/Aug/07 10:29 PM
OK I committed this! Thank you Stanislaw!

I ran a quick perf test on Wikipedia (first 50K docs only) and found
the new StandardTokenizer is ~6X faster – awesome

I made these small additional changes over the final patch before
committing:

  • I removed StandardAnalyzer.html "grammar doc" generation from
    build.xml since it was using jjdoc. Stanislaw, is there something
    in jflex that can generated a BNF description of the grammar as
    HTML?
  • I removed the @author tag from StandardTokenizer.java: we are
    removing all such tags and instead giving credit in CHANGES.txt.
  • I removed the whitespace-only diffs from common-build.xml &
    build.xml.
  • I put back the big comment that describes this tokenizer in
    StandardTokenizer.java.
  • Put standard Apache copyright headers in all sources.

Michael McCandless made changes - 08/Aug/07 10:29 PM
Resolution Fixed [ 1 ]
Lucene Fields [New] [New, Patch Available]
Status Open [ 1 ] Resolved [ 5 ]
Stanislaw Osinski added a comment - 09/Aug/07 06:53 AM
> * I removed StandardAnalyzer.html "grammar doc" generation from
> build.xml since it was using jjdoc. Stanislaw, is there something
> in jflex that can generated a BNF description of the grammar as
> HTML?

I've had a look at the JFlex docs and there doesn't seem to be such a tool for JFlex, I'm afraid.


Michael McCandless added a comment - 09/Aug/07 08:21 AM
> > * I removed StandardAnalyzer.html "grammar doc" generation from
> > build.xml since it was using jjdoc. Stanislaw, is there something
> > > in jflex that can generated a BNF description of the grammar as
> > > HTML?

> I've had a look at the JFlex docs and there doesn't seem to be such
> a tool for JFlex, I'm afraid.

OK, I think we can just live without this. Thanks!


Michael Busch made changes - 25/Jan/08 03:24 AM
Status Resolved [ 5 ] Closed [ 6 ]
Repository Revision Date User Message
ASF #616248 Tue Jan 29 10:51:44 UTC 2008 mikemccand LUCENE-1150: make StandardAnalyzer tokenizer constants public again (public access was accidentally removed with LUCENE-966)
Files Changed
MODIFY /lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex
MODIFY /lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java
MODIFY /lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.java
MODIFY /lucene/java/trunk/contrib/wikipedia/src/java/org/apache/lucene/wikipedia/analysis/WikipediaTokenizer.java
MODIFY /lucene/java/trunk/contrib/wikipedia/src/java/org/apache/lucene/wikipedia/analysis/WikipediaTokenizerImpl.java
MODIFY /lucene/java/trunk/src/test/org/apache/lucene/analysis/TestAnalyzers.java
MODIFY /lucene/java/trunk/CHANGES.txt
MODIFY /lucene/java/trunk/src/java/org/apache/lucene/store/FSDirectory.java
MODIFY /lucene/java/trunk/contrib/wikipedia/src/java/org/apache/lucene/wikipedia/analysis/WikipediaTokenizerImpl.jflex