Lucene - Core
  1. Lucene - Core
  2. LUCENE-966

A faster JFlex-based replacement for StandardAnalyzer

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.3
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      JFlex (http://www.jflex.de/) can be used to generate a faster (up to several times) replacement for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

      1. AnalyzerBenchmark.java
        4 kB
        Stanislaw Osinski
      2. jflex-analyzer-patch.txt
        52 kB
        Stanislaw Osinski
      3. jflex-analyzer-r560135-patch.txt
        67 kB
        Stanislaw Osinski
      4. jflex-analyzer-r561292-patch.txt
        143 kB
        Stanislaw Osinski
      5. jflex-analyzer-r561693-compatibility.txt
        132 kB
        Stanislaw Osinski
      6. jflex-analyzer-r562378-patch.txt
        236 kB
        Stanislaw Osinski
      7. jflex-analyzer-r562378-patch-nodup.txt
        133 kB
        Stanislaw Osinski

        Activity

        Hide
        Stanislaw Osinski added a comment -

        Here comes a somewhat rough (needing refactorings, see below) patch adding a JFlex-based repalcement for StandardAnalyzer called FastAnalyzer. A few comments:

        • all tests from TestStandardAnalyzer pass
        • tokenizer generation added to Ant build scripts

        Because FastAnalyzer shares some code with StandardAnalyzer, it might be worthwhile to do some refactorings:

        • have a common test for both analyzers (only if we want to keep them in sync)
        • have a common superclass for FastFilter and StandardFilter
        • have a common superclass for FastAnalyzer and StandardAnalyzer
        Show
        Stanislaw Osinski added a comment - Here comes a somewhat rough (needing refactorings, see below) patch adding a JFlex-based repalcement for StandardAnalyzer called FastAnalyzer. A few comments: all tests from TestStandardAnalyzer pass tokenizer generation added to Ant build scripts Because FastAnalyzer shares some code with StandardAnalyzer, it might be worthwhile to do some refactorings: have a common test for both analyzers (only if we want to keep them in sync) have a common superclass for FastFilter and StandardFilter have a common superclass for FastAnalyzer and StandardAnalyzer
        Hide
        Stanislaw Osinski added a comment -

        Here is a very simple benchmark I used to test the performance of StandardAnalyzer, FastAnalyzer and WhitespaceAnalyzer. I ran it on a number of JVMs and got the following results:

        Input: Reuters collection, the one used by contrib/benchmark, only documents
        longer than 100 bytes

        Machine: AMD Sempron 2600+, 2G RAM, Windows XP

        Sun 1.4.2 Server
        org.apache.lucene.analysis.standard.StandardAnalyzer: 15172 ms, 139667 tokens/s
        org.apache.lucene.analysis.fast.FastAnalyzer: 2438 ms, 869170 tokens/s
        org.apache.lucene.analysis.WhitespaceAnalyzer: 781 ms, 3547585 tokens/s

        Sun 1.4.2 Client
        org.apache.lucene.analysis.standard.StandardAnalyzer: 24187 ms, 87610 tokens/s
        org.apache.lucene.analysis.fast.FastAnalyzer: 3157 ms, 671218 tokens/s
        org.apache.lucene.analysis.WhitespaceAnalyzer: 1453 ms, 1906857 tokens/s

        Sun 1.5.0 Server
        org.apache.lucene.analysis.standard.StandardAnalyzer: 16062 ms, 131928 tokens/s
        org.apache.lucene.analysis.fast.FastAnalyzer: 2641 ms, 802361 tokens/s
        org.apache.lucene.analysis.WhitespaceAnalyzer: 750 ms, 3694218 tokens/s

        Sun 1.5.0 Client
        org.apache.lucene.analysis.standard.StandardAnalyzer: 23891 ms, 88696 tokens/s
        org.apache.lucene.analysis.fast.FastAnalyzer: 3641 ms, 581993 tokens/s
        org.apache.lucene.analysis.WhitespaceAnalyzer: 1437 ms, 1928089 tokens/s

        Sun 1.6.0 Server
        org.apache.lucene.analysis.standard.StandardAnalyzer: 13719 ms, 154460 tokens/s
        org.apache.lucene.analysis.fast.FastAnalyzer: 2484 ms, 853074 tokens/s
        org.apache.lucene.analysis.WhitespaceAnalyzer: 750 ms, 3694218 tokens/s

        Sun 1.6.0 Client
        org.apache.lucene.analysis.standard.StandardAnalyzer: 22312 ms, 94972 tokens/s
        org.apache.lucene.analysis.fast.FastAnalyzer: 2750 ms, 770558 tokens/s
        org.apache.lucene.analysis.WhitespaceAnalyzer: 1297 ms, 2136209 tokens/s

        IBM 1.4.2
        org.apache.lucene.analysis.standard.StandardAnalyzer: 11922 ms, 177741 tokens/s
        org.apache.lucene.analysis.fast.FastAnalyzer: 3218 ms, 658495 tokens/s
        org.apache.lucene.analysis.WhitespaceAnalyzer: 1407 ms, 1969199 tokens/s

        IBM 1.5.0
        org.apache.lucene.analysis.standard.StandardAnalyzer: 11797 ms, 179625 tokens/s
        org.apache.lucene.analysis.fast.FastAnalyzer: 2968 ms, 713961 tokens/s
        org.apache.lucene.analysis.WhitespaceAnalyzer: 1000 ms, 2770664 tokens/s

        BEA 1.4.2
        org.apache.lucene.analysis.standard.StandardAnalyzer: 16234 ms, 130530 tokens/s
        org.apache.lucene.analysis.fast.FastAnalyzer: 3344 ms, 633683 tokens/s
        org.apache.lucene.analysis.WhitespaceAnalyzer: 1343 ms, 2063040 tokens/s

        BEA 1.5.0 (looks really slow)
        org.apache.lucene.analysis.standard.StandardAnalyzer: 33891 ms, 62525 tokens/s
        org.apache.lucene.analysis.fast.FastAnalyzer: 12703 ms, 166813 tokens/s
        org.apache.lucene.analysis.WhitespaceAnalyzer: 4860 ms, 570095 tokens/s

        Show
        Stanislaw Osinski added a comment - Here is a very simple benchmark I used to test the performance of StandardAnalyzer, FastAnalyzer and WhitespaceAnalyzer. I ran it on a number of JVMs and got the following results: Input: Reuters collection, the one used by contrib/benchmark, only documents longer than 100 bytes Machine: AMD Sempron 2600+, 2G RAM, Windows XP Sun 1.4.2 Server org.apache.lucene.analysis.standard.StandardAnalyzer: 15172 ms, 139667 tokens/s org.apache.lucene.analysis.fast.FastAnalyzer: 2438 ms, 869170 tokens/s org.apache.lucene.analysis.WhitespaceAnalyzer: 781 ms, 3547585 tokens/s Sun 1.4.2 Client org.apache.lucene.analysis.standard.StandardAnalyzer: 24187 ms, 87610 tokens/s org.apache.lucene.analysis.fast.FastAnalyzer: 3157 ms, 671218 tokens/s org.apache.lucene.analysis.WhitespaceAnalyzer: 1453 ms, 1906857 tokens/s Sun 1.5.0 Server org.apache.lucene.analysis.standard.StandardAnalyzer: 16062 ms, 131928 tokens/s org.apache.lucene.analysis.fast.FastAnalyzer: 2641 ms, 802361 tokens/s org.apache.lucene.analysis.WhitespaceAnalyzer: 750 ms, 3694218 tokens/s Sun 1.5.0 Client org.apache.lucene.analysis.standard.StandardAnalyzer: 23891 ms, 88696 tokens/s org.apache.lucene.analysis.fast.FastAnalyzer: 3641 ms, 581993 tokens/s org.apache.lucene.analysis.WhitespaceAnalyzer: 1437 ms, 1928089 tokens/s Sun 1.6.0 Server org.apache.lucene.analysis.standard.StandardAnalyzer: 13719 ms, 154460 tokens/s org.apache.lucene.analysis.fast.FastAnalyzer: 2484 ms, 853074 tokens/s org.apache.lucene.analysis.WhitespaceAnalyzer: 750 ms, 3694218 tokens/s Sun 1.6.0 Client org.apache.lucene.analysis.standard.StandardAnalyzer: 22312 ms, 94972 tokens/s org.apache.lucene.analysis.fast.FastAnalyzer: 2750 ms, 770558 tokens/s org.apache.lucene.analysis.WhitespaceAnalyzer: 1297 ms, 2136209 tokens/s IBM 1.4.2 org.apache.lucene.analysis.standard.StandardAnalyzer: 11922 ms, 177741 tokens/s org.apache.lucene.analysis.fast.FastAnalyzer: 3218 ms, 658495 tokens/s org.apache.lucene.analysis.WhitespaceAnalyzer: 1407 ms, 1969199 tokens/s IBM 1.5.0 org.apache.lucene.analysis.standard.StandardAnalyzer: 11797 ms, 179625 tokens/s org.apache.lucene.analysis.fast.FastAnalyzer: 2968 ms, 713961 tokens/s org.apache.lucene.analysis.WhitespaceAnalyzer: 1000 ms, 2770664 tokens/s BEA 1.4.2 org.apache.lucene.analysis.standard.StandardAnalyzer: 16234 ms, 130530 tokens/s org.apache.lucene.analysis.fast.FastAnalyzer: 3344 ms, 633683 tokens/s org.apache.lucene.analysis.WhitespaceAnalyzer: 1343 ms, 2063040 tokens/s BEA 1.5.0 (looks really slow) org.apache.lucene.analysis.standard.StandardAnalyzer: 33891 ms, 62525 tokens/s org.apache.lucene.analysis.fast.FastAnalyzer: 12703 ms, 166813 tokens/s org.apache.lucene.analysis.WhitespaceAnalyzer: 4860 ms, 570095 tokens/s
        Hide
        Yonik Seeley added a comment -

        Thanks Staszek, very nice!
        If FastTokenizer is compatible with StandardTokenizer, it should probably just replace it.
        Have you tried this replacement and running all the lucene tests?

        Show
        Yonik Seeley added a comment - Thanks Staszek, very nice! If FastTokenizer is compatible with StandardTokenizer, it should probably just replace it. Have you tried this replacement and running all the lucene tests?
        Hide
        Mark Miller added a comment -

        Great patch! And a very quick turnaround to boot!

        I am seeing a HUGE speed increase!

        I also think this should certainly become the new StandardAnalyzer.

        I am running into only one issue...when rebuilding, after a clean, FastTokenizer does not seem to get generated – FastTokenizerImpl is there, but no FastTokenizer interface; it appears to be wiped out in the clean but not regenerated in the build. If I put it in manually, everything works great.

        • Mark
        Show
        Mark Miller added a comment - Great patch! And a very quick turnaround to boot! I am seeing a HUGE speed increase! I also think this should certainly become the new StandardAnalyzer. I am running into only one issue...when rebuilding, after a clean, FastTokenizer does not seem to get generated – FastTokenizerImpl is there, but no FastTokenizer interface; it appears to be wiped out in the clean but not regenerated in the build. If I put it in manually, everything works great. Mark
        Hide
        Paul Smith added a comment -

        We did pretty much the same thing here at Aconex, The tokenization mechanism in the old javacc-based analyser is woeful compared to what JFlex outputs.

        Nice work!

        Show
        Paul Smith added a comment - We did pretty much the same thing here at Aconex, The tokenization mechanism in the old javacc-based analyser is woeful compared to what JFlex outputs. Nice work!
        Hide
        Stanislaw Osinski added a comment -

        Here's another patch (against r560135) which:

        • replaces JavaCC-based StandardTokenizer with a JFlex-based one (all tests pass), as suggested by Yonik
        • updates build scripts accordingly and fixes the issue of deleting one file to many, as reported by Mark

        Additionally, I noticed that ChainedFilterTest from contrib/miscellaneous would fail on machines with non-US locale, so this patch fixes that too.

        Show
        Stanislaw Osinski added a comment - Here's another patch (against r560135) which: replaces JavaCC-based StandardTokenizer with a JFlex-based one (all tests pass), as suggested by Yonik updates build scripts accordingly and fixes the issue of deleting one file to many, as reported by Mark Additionally, I noticed that ChainedFilterTest from contrib/miscellaneous would fail on machines with non-US locale, so this patch fixes that too.
        Hide
        Stanislaw Osinski added a comment -

        Here is another (this time let's call it "final" patch for this issue which comes with a simplified (but test-wise equivalent) grammar for numeric expressions that gives about 5% performance increase.

        Show
        Stanislaw Osinski added a comment - Here is another (this time let's call it "final" patch for this issue which comes with a simplified (but test-wise equivalent) grammar for numeric expressions that gives about 5% performance increase.
        Hide
        Michael McCandless added a comment -

        I took the patch from here (to use jflex for StandardAnalyzer) and
        merged it with the patch from LUCENE-969 (re-use Token & TokenStream)
        to measure the net performance gains.

        I measure the time to just tokenize all of Wikipedia using
        StandardAnalyzer using contrib/benchmark plus patch from LUCENE-967
        (test details are described in LUCENE-969).

        With the jflex patch it takes 646 sec (best of 2 runs); when I then
        merge in the patch from LUCENE-969 it takes 455 sec. Subtracting off
        the time to just load all Wikipedia docs (= 112 sec) that gives net
        additional speedup of 36% (534 sec -> 343 sec) when using LUCENE-969
        in addition to jflex.

        A couple other things I noticed:

        • The init cost of jflex (StandardTokenizerImpl) seems to be fairly
          high: when I repeat the above test with smallish docs (100 tokens
          each) instead, the gain is around 84%. I think this just makes
          the new reusableTokenStream() in LUCENE-969 important to commit.
        • I'm seeing differing token counts with the jflex StandardAnalyzer
          vs the current one; I think there is some difference here. I will
          track down which tokens differ and post back...
        Show
        Michael McCandless added a comment - I took the patch from here (to use jflex for StandardAnalyzer) and merged it with the patch from LUCENE-969 (re-use Token & TokenStream) to measure the net performance gains. I measure the time to just tokenize all of Wikipedia using StandardAnalyzer using contrib/benchmark plus patch from LUCENE-967 (test details are described in LUCENE-969 ). With the jflex patch it takes 646 sec (best of 2 runs); when I then merge in the patch from LUCENE-969 it takes 455 sec. Subtracting off the time to just load all Wikipedia docs (= 112 sec) that gives net additional speedup of 36% (534 sec -> 343 sec) when using LUCENE-969 in addition to jflex. A couple other things I noticed: The init cost of jflex (StandardTokenizerImpl) seems to be fairly high: when I repeat the above test with smallish docs (100 tokens each) instead, the gain is around 84%. I think this just makes the new reusableTokenStream() in LUCENE-969 important to commit. I'm seeing differing token counts with the jflex StandardAnalyzer vs the current one; I think there is some difference here. I will track down which tokens differ and post back...
        Hide
        Michael McCandless added a comment -

        I tracked down at least some differences between the JavaCC vs JFlex
        versions of StandardAnalyzer.

        I think we should resolve these before committing.

        I just printed all tokens for the first 20 Wikipedia docs and diff'd
        the outputs.

        Here are the categories of differences that I saw:

        • Only the type differs on a filename-like token:

        OLD: (2004.jpg,34461,34469,type=<HOST>)
        NEW: (2004.jpg,34461,34469,type=<NUM>)

        In this case the old StandardAnalyzer called "2004.jpg" a HOST and
        the new one calls it a NUM. Seems like neither one is right!

        • Only the type differs on a number token:

        OLD: (62.46,37004,37009,type=<HOST>)
        NEW: (62.46,37004,37009,type=<NUM>)

        The new tokenizer looks right here. I guess the decimal point
        confuses the JavaCC (old) one.

        • Different number of tokens produced for number-like-token:

        OLD: (978-0-94045043-1,86408,86424,type=<NUM>)
        NEW: (978-0-94045043,86408,86422,type=<NUM>)
        (1,86423,86424,type=<ALPHANUM>)

        The new one split off the final "-1" as its own token, and called
        it ALPHANUM not NUM. I think the old behavior is correct.

        • Different number of tokens produced for filename:

        OLD: (78academyawards/rules/rule02.html,7194,7227,type=<NUM>)
        NEW: (78academyawards/rules/rule02,7194,7222,type=<NUM>)
        (html,7223,7227,type=<ALPHANUM>)

        I think the old one is better, though it should not be called a
        NUM (maybe we need a new "FILENAME" token type?).

        • Same as above, but split on final '_' instead of '.' ('-' also
          shows this behavior):

        OLD: (2006-03-11t082958z_01_ban130523_rtridst_0_ozabs,2076,2123,type=<NUM>)
        new: (2006-03-11t082958z_01_ban130523_rtridst_0,2076,2117,type=<NUM>)
        (ozabs,2118,2123,type=<ALPHANUM>)

        Show
        Michael McCandless added a comment - I tracked down at least some differences between the JavaCC vs JFlex versions of StandardAnalyzer. I think we should resolve these before committing. I just printed all tokens for the first 20 Wikipedia docs and diff'd the outputs. Here are the categories of differences that I saw: Only the type differs on a filename-like token: OLD: (2004.jpg,34461,34469,type=<HOST>) NEW: (2004.jpg,34461,34469,type=<NUM>) In this case the old StandardAnalyzer called "2004.jpg" a HOST and the new one calls it a NUM. Seems like neither one is right! Only the type differs on a number token: OLD: (62.46,37004,37009,type=<HOST>) NEW: (62.46,37004,37009,type=<NUM>) The new tokenizer looks right here. I guess the decimal point confuses the JavaCC (old) one. Different number of tokens produced for number-like-token: OLD: (978-0-94045043-1,86408,86424,type=<NUM>) NEW: (978-0-94045043,86408,86422,type=<NUM>) (1,86423,86424,type=<ALPHANUM>) The new one split off the final "-1" as its own token, and called it ALPHANUM not NUM. I think the old behavior is correct. Different number of tokens produced for filename: OLD: (78academyawards/rules/rule02.html,7194,7227,type=<NUM>) NEW: (78academyawards/rules/rule02,7194,7222,type=<NUM>) (html,7223,7227,type=<ALPHANUM>) I think the old one is better, though it should not be called a NUM (maybe we need a new "FILENAME" token type?). Same as above, but split on final '_' instead of '.' ('-' also shows this behavior): OLD: (2006-03-11t082958z_01_ban130523_rtridst_0_ozabs,2076,2123,type=<NUM>) new: (2006-03-11t082958z_01_ban130523_rtridst_0,2076,2117,type=<NUM>) (ozabs,2118,2123,type=<ALPHANUM>)
        Hide
        Stanislaw Osinski added a comment -

        Thanks for spotting the differences, I'll add them to the unit tests and will correct the tokenizer accordingly.

        One doubt I have is about the filename-like tokens, e.g.:

        OLD: (2004.jpg,34461,34469,type=<HOST>)
        NEW: (2004.jpg,34461,34469,type=<NUM>)

        To be honest, both variants seem "almost" correct. If you try 2007.org – this is a correct domain name (and has a funny website on it , so, given the fact that we don't check for typical suffixes, such as ".com", <HOST> doesn't seem wrong. On the other hand, 2004.jpg may well have been some sort of numerical code or a product number, so <NUM> is not totally irrelevant either.

        For the JFlex-based tokenizer, I put the <NUM> rule matching first, as it gives some nice performance benefits. We can put HOST first, and then we'll get compliance with the old version.

        Another option we might consider is:

        • adding a new token type for file names (get a list of common extensions or even assume that an extension is simply three alphanumerical characters)
        • checking for common domain names in hosts (something along the lines of: "mil" | "info" | "gov" | "edu" | "biz" | "com" | "org" | "net" | "arpa" | {LETTER} {2}

          )

        I'm not sure how this will affect the performance of the tokenizer, but my rough guess is that if we don't come up with very complex/ backtracking-prone rules there should not be too much of a difference. On the other hand, if 100% compatibility with the old tokenizer is a priority, adding new token types is not a good idea, I guess.

        Finally, when it comes to the initialization time of the new tokenizer – according to the JFlex documentation, some time is required to unpack the transition tables. But the unpacking takes place during the initialization of static fields, so once the class is loaded the overhead should be negligible.

        Show
        Stanislaw Osinski added a comment - Thanks for spotting the differences, I'll add them to the unit tests and will correct the tokenizer accordingly. One doubt I have is about the filename-like tokens, e.g.: OLD: (2004.jpg,34461,34469,type=<HOST>) NEW: (2004.jpg,34461,34469,type=<NUM>) To be honest, both variants seem "almost" correct. If you try 2007.org – this is a correct domain name (and has a funny website on it , so, given the fact that we don't check for typical suffixes, such as ".com", <HOST> doesn't seem wrong. On the other hand, 2004.jpg may well have been some sort of numerical code or a product number, so <NUM> is not totally irrelevant either. For the JFlex-based tokenizer, I put the <NUM> rule matching first, as it gives some nice performance benefits. We can put HOST first, and then we'll get compliance with the old version. Another option we might consider is: adding a new token type for file names (get a list of common extensions or even assume that an extension is simply three alphanumerical characters) checking for common domain names in hosts (something along the lines of: "mil" | "info" | "gov" | "edu" | "biz" | "com" | "org" | "net" | "arpa" | {LETTER} {2} ) I'm not sure how this will affect the performance of the tokenizer, but my rough guess is that if we don't come up with very complex/ backtracking-prone rules there should not be too much of a difference. On the other hand, if 100% compatibility with the old tokenizer is a priority, adding new token types is not a good idea, I guess. Finally, when it comes to the initialization time of the new tokenizer – according to the JFlex documentation, some time is required to unpack the transition tables. But the unpacking takes place during the initialization of static fields, so once the class is loaded the overhead should be negligible.
        Hide
        Mark Miller added a comment -

        I think it is important that we do not modify the StandardAnalyzer and that this improved version has results identical to the original. I believe that the StandardAnalyzer is pretty heavily used – these performance increases could give many users huge gains. By changing this Analyzer at all we will be forcing users to effectively lose terms in their indexes if they want to upgrade core Lucene for use on an older index.

        I think any enhanced StandardAnalzyer should probably be released as a new Analyzer. This one should mimic the old though, as the performance gains will be very beneficial to a lot of current users.

        • Mark
        Show
        Mark Miller added a comment - I think it is important that we do not modify the StandardAnalyzer and that this improved version has results identical to the original. I believe that the StandardAnalyzer is pretty heavily used – these performance increases could give many users huge gains. By changing this Analyzer at all we will be forcing users to effectively lose terms in their indexes if they want to upgrade core Lucene for use on an older index. I think any enhanced StandardAnalzyer should probably be released as a new Analyzer. This one should mimic the old though, as the performance gains will be very beneficial to a lot of current users. Mark
        Hide
        Michael McCandless added a comment -

        I agree, let's try to perfectly match the tokens of the old
        StandardAnalyzer so we have a way-faster drop-in replacement.

        The speedups of JFlex are amazing: based on a quick test, with JFlex +
        patch from LUCENE-969, the new StandardAnalyzer is only 2.09X slower
        than WhitespaceAnalyzer even though it's doing so much more ...

        > Finally, when it comes to the initialization time of the new
        > tokenizer – according to the JFlex documentation, some time is
        > required to unpack the transition tables. But the unpacking takes
        > place during the initialization of static fields, so once the class
        > is loaded the overhead should be negligible.

        Yeah I'm baffled why it's that much slower, but on 100 token docs I
        definitely see LUCENE-969 making things 84% faster but "only" 36%
        faster if I use the full Wikipedia doc (which are much larger than 100
        tokens on average). If we tested even smaller docs I think the gains
        would be even more.

        When I ran under the profiler it was the StandardTokenizerImpl
        <init>(java.io.Reader) way on the top. Maybe it's the cost of new'ing
        the 16 KB buffer each time?

        In any event I think it's OK, so long as we get LUCENE-969 in, and
        document the importance of using reusableTokenStream() API for better
        performance.

        Show
        Michael McCandless added a comment - I agree, let's try to perfectly match the tokens of the old StandardAnalyzer so we have a way-faster drop-in replacement. The speedups of JFlex are amazing: based on a quick test, with JFlex + patch from LUCENE-969 , the new StandardAnalyzer is only 2.09X slower than WhitespaceAnalyzer even though it's doing so much more ... > Finally, when it comes to the initialization time of the new > tokenizer – according to the JFlex documentation, some time is > required to unpack the transition tables. But the unpacking takes > place during the initialization of static fields, so once the class > is loaded the overhead should be negligible. Yeah I'm baffled why it's that much slower, but on 100 token docs I definitely see LUCENE-969 making things 84% faster but "only" 36% faster if I use the full Wikipedia doc (which are much larger than 100 tokens on average). If we tested even smaller docs I think the gains would be even more. When I ran under the profiler it was the StandardTokenizerImpl <init>(java.io.Reader) way on the top. Maybe it's the cost of new'ing the 16 KB buffer each time? In any event I think it's OK, so long as we get LUCENE-969 in, and document the importance of using reusableTokenStream() API for better performance.
        Hide
        Doug Cutting added a comment -

        It is important that the same sequence of token text is produced, but I think we could live with different token types in some cases, if we must. Few applications depend on token types, no?

        Provided the token text issues can be resolved, I'd like to see StandardTokenizer replaced with this. Performance is important, and ideally folks shouldn't have to change their applications to see performance improvements.

        Show
        Doug Cutting added a comment - It is important that the same sequence of token text is produced, but I think we could live with different token types in some cases, if we must. Few applications depend on token types, no? Provided the token text issues can be resolved, I'd like to see StandardTokenizer replaced with this. Performance is important, and ideally folks shouldn't have to change their applications to see performance improvements.
        Hide
        Stanislaw Osinski added a comment -

        When digging deeper into the issues of compatibility with the original StandardAnalyzer, I stumbled upon something strange. Take the following text:

        78academyawards/rules/rule02.html,7194,7227,type

        which was tokenized by the original StandardAnalyzer as one <NUM>. If you look at the definition of the <NUM> token:

        // every other segment must have at least one digit
        <NUM: (<ALPHANUM> <P> <HAS_DIGIT>

        <HAS_DIGIT> <P> <ALPHANUM>
        <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
        <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
        <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
        <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
        )

        you'll see that, as explained in the comment, every other segment must have at least one digit. But actually, according to my understanding, this rule should not match the above text as a whole (and with JFlex it doesn't , actually). Below is the text split by punctuation characters, and it looks like there is no way of splitting this text into alternating segments, every second of which must have a digit (A = ALPHANUM, H = HAS_DIGIT):

        78academyawards / rules / rule02 . html , 7194 , 7227 , type
        H P A P H P A P H P A P H?* (starting from the beginning)
        H?* P A P H P A (starting from the end)

        • (would have to be H, but no digits in substring "type" or "html")

        I have no idea why JavaCC matched the whole text as a <NUM>, JFlex behaved "more correctly" here.

        Now I can see two solutions:

        • try to patch the JFlex grammar to emulate JavaCC quirks (though I may not be aware of most of them...)
        • relax the <NUM> rule a little bit (JFlex notation):

        // there must be at least one segment with a digit
        NUM = (

        {P} ({HAS_DIGIT} | {ALPHANUM}))* {HAS_DIGIT} ({P}

        (

        {HAS_DIGIT}

        |

        {ALPHANUM}

        ))*

        With this definition, again, all StandardAnalyzer tests pass, plus all texts along the lines of:

        2006-03-11t082958z_01_ban130523_rtridst_0_ozabs,2076,2123,type
        78academyawards/rules/rule02.html,7194,7227,type
        978-0-94045043-1,86408,86424,type
        62.46,37004,37009,type (this one was parsed as <HOST> by the original analyzer)

        get parsed as a whole as one <NUM>, which is equivalent to what JavaCC-based version would do. I will attach a corresponding patch in a second.

        Show
        Stanislaw Osinski added a comment - When digging deeper into the issues of compatibility with the original StandardAnalyzer, I stumbled upon something strange. Take the following text: 78academyawards/rules/rule02.html,7194,7227,type which was tokenized by the original StandardAnalyzer as one <NUM>. If you look at the definition of the <NUM> token: // every other segment must have at least one digit <NUM: (<ALPHANUM> <P> <HAS_DIGIT> <HAS_DIGIT> <P> <ALPHANUM> <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+ <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+ <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+ <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+ ) you'll see that, as explained in the comment, every other segment must have at least one digit. But actually, according to my understanding, this rule should not match the above text as a whole (and with JFlex it doesn't , actually). Below is the text split by punctuation characters, and it looks like there is no way of splitting this text into alternating segments, every second of which must have a digit (A = ALPHANUM, H = HAS_DIGIT): 78academyawards / rules / rule02 . html , 7194 , 7227 , type H P A P H P A P H P A P H?* (starting from the beginning) H?* P A P H P A (starting from the end) (would have to be H, but no digits in substring "type" or "html") I have no idea why JavaCC matched the whole text as a <NUM>, JFlex behaved "more correctly" here. Now I can see two solutions: try to patch the JFlex grammar to emulate JavaCC quirks (though I may not be aware of most of them...) relax the <NUM> rule a little bit (JFlex notation): // there must be at least one segment with a digit NUM = ( {P} ({HAS_DIGIT} | {ALPHANUM}))* {HAS_DIGIT} ({P} ( {HAS_DIGIT} | {ALPHANUM} ))* With this definition, again, all StandardAnalyzer tests pass, plus all texts along the lines of: 2006-03-11t082958z_01_ban130523_rtridst_0_ozabs,2076,2123,type 78academyawards/rules/rule02.html,7194,7227,type 978-0-94045043-1,86408,86424,type 62.46,37004,37009,type (this one was parsed as <HOST> by the original analyzer) get parsed as a whole as one <NUM>, which is equivalent to what JavaCC-based version would do. I will attach a corresponding patch in a second.
        Hide
        Stanislaw Osinski added a comment -

        A patch for better compatibility with the StandardAnalyzer containing:

        • relaxed definition of the <NUM> token
        • new test cases in TestStandardAnalyzer

        I noticed that with this patch org.apache.lucene.benchmark.quality.TestQualityRun.testTrecQuality fails, but I'm not sure if this is related to the tokenizer.

        Show
        Stanislaw Osinski added a comment - A patch for better compatibility with the StandardAnalyzer containing: relaxed definition of the <NUM> token new test cases in TestStandardAnalyzer I noticed that with this patch org.apache.lucene.benchmark.quality.TestQualityRun.testTrecQuality fails, but I'm not sure if this is related to the tokenizer.
        Hide
        Michael McCandless added a comment -

        Oddly, the patch for TestStandardAnalyzer failed to apply for me (but
        the rest did), so I manually merged those changes in. Oh, I see: it
        was the "Korean words" test – somehow the characters got mapped to
        ?'s in your patch. This is why the patch didn't apply, I think?
        Maybe you used a diffing tool that wasn't happy with unicode or
        something?

        I also see the quality test failing in contrib benchmark. I fear
        something about the new StandardAnalyzer is in fact causing this test
        to fail (it passes on a clean checkout). That test uses
        StandardAnalyzer.

        KO I re-tested the old vs new StandardAnalyzer on Wikipedia and I
        still found some differences, I think only on these very large
        URL-like tokens. Here's one:

        OLD
        (money.cnn.com,1382,1395,type=<HOST>)
        (magazines,1396,1405,type=<ALPHANUM>)
        (fortune,1406,1413,type=<ALPHANUM>)
        (fortune,1414,1421,type=<ALPHANUM>)
        (archive/2007/03/19/8402357,1422,1448,type=<NUM>)
        (index.htm,1449,1458,type=<HOST>)

        NEW
        (/money.cnn.com/magazines/fortune/fortune_archive/2007/03/19/8402357/index.htm,1381,1458,type=<NUM>)

        I like the NEW behavior better but I fear we should try to match the
        old one?

        Here's another one:

        OLD
        (mid-20th,2436,2444,type=<NUM>)

        NEW
        (mid,2436,2439,type=<ALPHANUM>)
        (-20th,2439,2444,type=<NUM>)

        I like the old behavior better here.

        Another one:

        OLD
        (safari-0-sheikh,12011,12026,type=<NUM>)
        (zayed,12027,12032,type=<ALPHANUM>)
        (grand,12033,12038,type=<ALPHANUM>)
        (mosque.jpg,12039,12049,type=<HOST>)

        NEW
        (safari,12011,12017,type=<ALPHANUM>)
        (0-sheikh-zayed-grand-mosque.jpg,12018,12049,type=<NUM>)

        Another one:

        OLD
        (semitica-01.png,616,631,type=<NUM>)

        NEW
        (-semitica-01.png,615,631,type=<NUM>)

        Show
        Michael McCandless added a comment - Oddly, the patch for TestStandardAnalyzer failed to apply for me (but the rest did), so I manually merged those changes in. Oh, I see: it was the "Korean words" test – somehow the characters got mapped to ?'s in your patch. This is why the patch didn't apply, I think? Maybe you used a diffing tool that wasn't happy with unicode or something? I also see the quality test failing in contrib benchmark. I fear something about the new StandardAnalyzer is in fact causing this test to fail (it passes on a clean checkout). That test uses StandardAnalyzer. KO I re-tested the old vs new StandardAnalyzer on Wikipedia and I still found some differences, I think only on these very large URL-like tokens. Here's one: OLD (money.cnn.com,1382,1395,type=<HOST>) (magazines,1396,1405,type=<ALPHANUM>) (fortune,1406,1413,type=<ALPHANUM>) (fortune,1414,1421,type=<ALPHANUM>) (archive/2007/03/19/8402357,1422,1448,type=<NUM>) (index.htm,1449,1458,type=<HOST>) NEW (/money.cnn.com/magazines/fortune/fortune_archive/2007/03/19/8402357/index.htm,1381,1458,type=<NUM>) I like the NEW behavior better but I fear we should try to match the old one? Here's another one: OLD (mid-20th,2436,2444,type=<NUM>) NEW (mid,2436,2439,type=<ALPHANUM>) (-20th,2439,2444,type=<NUM>) I like the old behavior better here. Another one: OLD (safari-0-sheikh,12011,12026,type=<NUM>) (zayed,12027,12032,type=<ALPHANUM>) (grand,12033,12038,type=<ALPHANUM>) (mosque.jpg,12039,12049,type=<HOST>) NEW (safari,12011,12017,type=<ALPHANUM>) (0-sheikh-zayed-grand-mosque.jpg,12018,12049,type=<NUM>) Another one: OLD (semitica-01.png,616,631,type=<NUM>) NEW (-semitica-01.png,615,631,type=<NUM>)
        Hide
        Doron Cohen added a comment -

        The search quality test failure can be caused by the standard
        analyzer generating different tokens than before. (has nothing
        to do with token types.)

        This is because the test's topics (queries) and qrels (expected matches)
        were created by examining an index that was created using the current
        standard analyzer. Now, running this test with an analyzer that creates
        other tokens is likely to fail.

        It is not difficult to update this test for a modified analyzer, but it seems
        better to me to preserve the original standard analyzer behavior.

        Show
        Doron Cohen added a comment - The search quality test failure can be caused by the standard analyzer generating different tokens than before. (has nothing to do with token types.) This is because the test's topics (queries) and qrels (expected matches) were created by examining an index that was created using the current standard analyzer. Now, running this test with an analyzer that creates other tokens is likely to fail. It is not difficult to update this test for a modified analyzer, but it seems better to me to preserve the original standard analyzer behavior.
        Hide
        Stanislaw Osinski added a comment -

        Thanks for more test cases. I guess the biggest problem here is that the scanner generated by JavaCC doesn't seem to strictly follow the specification (see https://issues.apache.org/jira/browse/LUCENE-966#action_12516893), so I'd need to emulate possible JavaCC "bugs" I'm not aware of at the moment (I'm not an expert on lexical scanner generation either, not yet at least . I can add some workarounds to the grammar to make the known incompatibility examples work, but this won't guarantee consistency in general.

        As a side note, it's a shame there's no trace of the version of JavaCC that was used to generate the scanner for the original StandardAnalyzer. I'm also curious if the results of the current JavaCC grammar would be the same with the newest version of the generator (4.0 I guess) – I'll try to check that.

        Anyway, I'll take a look at the problem in more depth once again. And in the worst case scenario, we can keep the StandardAnalyzer as it was and add the new one next to it so that people can have a choice (on the other hand, this might be a problem for the quality tests).

        Show
        Stanislaw Osinski added a comment - Thanks for more test cases. I guess the biggest problem here is that the scanner generated by JavaCC doesn't seem to strictly follow the specification (see https://issues.apache.org/jira/browse/LUCENE-966#action_12516893 ), so I'd need to emulate possible JavaCC "bugs" I'm not aware of at the moment (I'm not an expert on lexical scanner generation either, not yet at least . I can add some workarounds to the grammar to make the known incompatibility examples work, but this won't guarantee consistency in general. As a side note, it's a shame there's no trace of the version of JavaCC that was used to generate the scanner for the original StandardAnalyzer. I'm also curious if the results of the current JavaCC grammar would be the same with the newest version of the generator (4.0 I guess) – I'll try to check that. Anyway, I'll take a look at the problem in more depth once again. And in the worst case scenario, we can keep the StandardAnalyzer as it was and add the new one next to it so that people can have a choice (on the other hand, this might be a problem for the quality tests).
        Hide
        Michael McCandless added a comment -

        If it really is down to emulating the bugs/oddities in JavaCC then I
        think it's not worth polluting the new tokenizer with these legacy
        bugs, unless one or two cases can match perfectly and not degrade
        performance too badly?

        And maybe what we should do is make this a new tokenizer, calling it
        StandardAnalyzer2, and then deprecate the existing StandardAnalyzer?
        Then remove any & all JavaCC bug emulation from the new one.

        This way people relying on the precise bugs in JavaCC tokenization are
        not hurt on upgrading to 2.3 and are given a chance to migrate to the
        new one (with 1 release of deprecated StandardAnalyzer). And new
        people will use the new faster one.

        Show
        Michael McCandless added a comment - If it really is down to emulating the bugs/oddities in JavaCC then I think it's not worth polluting the new tokenizer with these legacy bugs, unless one or two cases can match perfectly and not degrade performance too badly? And maybe what we should do is make this a new tokenizer, calling it StandardAnalyzer2, and then deprecate the existing StandardAnalyzer? Then remove any & all JavaCC bug emulation from the new one. This way people relying on the precise bugs in JavaCC tokenization are not hurt on upgrading to 2.3 and are given a chance to migrate to the new one (with 1 release of deprecated StandardAnalyzer). And new people will use the new faster one.
        Hide
        Stanislaw Osinski added a comment -

        To be precise – I'm not 100% sure that this is a bug in JavaCC (I'll try to browse/ask on their mailing list to find out), but it looks like the scanner generated by JavaCC does not really strictly follow the grammar. I discovered this when you gave the examples of different token produced by JFlex- and JavaCC-based scanners generated from equivalent grammars – JFlex seemed to behave "correctly", hence different tokens.

        I was about to start trying to hack the grammar to produce results simiar to StandardAnalyzer (based on a finite set of test cases . I'll see how it compares performance-wise too.

        Show
        Stanislaw Osinski added a comment - To be precise – I'm not 100% sure that this is a bug in JavaCC (I'll try to browse/ask on their mailing list to find out), but it looks like the scanner generated by JavaCC does not really strictly follow the grammar. I discovered this when you gave the examples of different token produced by JFlex- and JavaCC-based scanners generated from equivalent grammars – JFlex seemed to behave "correctly", hence different tokens. I was about to start trying to hack the grammar to produce results simiar to StandardAnalyzer (based on a finite set of test cases . I'll see how it compares performance-wise too.
        Hide
        Grant Ingersoll added a comment -

        +1 for deprecating StandardAnalyzer, although it is a little weird to have a StandardAnalyzer2, which is it the standard or not? Maybe we could call it the DefaultAnalyzer or something like that. I never much cared for tacking on numbers on the end of class names.

        I agree with Mike M's last comment and approach, though, if that proves to be the case for the differences.

        Show
        Grant Ingersoll added a comment - +1 for deprecating StandardAnalyzer, although it is a little weird to have a StandardAnalyzer2, which is it the standard or not? Maybe we could call it the DefaultAnalyzer or something like that. I never much cared for tacking on numbers on the end of class names. I agree with Mike M's last comment and approach, though, if that proves to be the case for the differences.
        Hide
        Michael McCandless added a comment -

        I like the name DefaultAnalyzer.

        Show
        Michael McCandless added a comment - I like the name DefaultAnalyzer.
        Hide
        Mark Miller added a comment -

        These issues seem odd.

        Both JavaCC and Flex match with the same rules:
        1. Longest match first
        2. If match size is the same, use the first in the grammar

        OLD
        (money.cnn.com,1382,1395,type=<HOST>)
        (magazines,1396,1405,type=<ALPHANUM>)
        (fortune,1406,1413,type=<ALPHANUM>)
        (fortune,1414,1421,type=<ALPHANUM>)
        (archive/2007/03/19/8402357,1422,1448,type=<NUM>)
        (index.htm,1449,1458,type=<HOST>)

        NEW
        (/money.cnn.com/magazines/fortune/fortune_archive/2007/03/19/8402357/index.htm,1381,1458,type=<NUM>)

        The old is correct and the NEW should not match <NUM>. <NUM> should break on '/' and '.' and every other token from the break should have a digit for a NUM match to occur. This is not the case.

        OLD
        (mid-20th,2436,2444,type=<NUM>)

        NEW
        (mid,2436,2439,type=<ALPHANUM>)
        (-20th,2439,2444,type=<NUM>)

        Something is wrong with the NEW one. <NUM> is certainly a valid longer match.

        OLD
        (safari-0-sheikh,12011,12026,type=<NUM>)
        (zayed,12027,12032,type=<ALPHANUM>)
        (grand,12033,12038,type=<ALPHANUM>)
        (mosque.jpg,12039,12049,type=<HOST>)

        NEW
        (safari,12011,12017,type=<ALPHANUM>)
        (0-sheikh-zayed-grand-mosque.jpg,12018,12049,type=<NUM>)

        Again, something seems wrong with the NEW. (safari-0-sheikh,12011,12026,type=<NUM>) is a correct and longer match than (safari,12011,12017,type=<ALPHANUM>)

        It would be nice to have the source text for these comparisons.

        Also, a hard vote against StandardAnalyzer2 <g> Default is arguable as well, as this wouldn't be the default analyzer you should use in many cases (don't like standard because of that either).

        From the latest samples, I would say something is off with the NEW and OLD appears mostly correct.

        • Mark
        Show
        Mark Miller added a comment - These issues seem odd. Both JavaCC and Flex match with the same rules: 1. Longest match first 2. If match size is the same, use the first in the grammar OLD (money.cnn.com,1382,1395,type=<HOST>) (magazines,1396,1405,type=<ALPHANUM>) (fortune,1406,1413,type=<ALPHANUM>) (fortune,1414,1421,type=<ALPHANUM>) (archive/2007/03/19/8402357,1422,1448,type=<NUM>) (index.htm,1449,1458,type=<HOST>) NEW (/money.cnn.com/magazines/fortune/fortune_archive/2007/03/19/8402357/index.htm,1381,1458,type=<NUM>) The old is correct and the NEW should not match <NUM>. <NUM> should break on '/' and '.' and every other token from the break should have a digit for a NUM match to occur. This is not the case. OLD (mid-20th,2436,2444,type=<NUM>) NEW (mid,2436,2439,type=<ALPHANUM>) (-20th,2439,2444,type=<NUM>) Something is wrong with the NEW one. <NUM> is certainly a valid longer match. OLD (safari-0-sheikh,12011,12026,type=<NUM>) (zayed,12027,12032,type=<ALPHANUM>) (grand,12033,12038,type=<ALPHANUM>) (mosque.jpg,12039,12049,type=<HOST>) NEW (safari,12011,12017,type=<ALPHANUM>) (0-sheikh-zayed-grand-mosque.jpg,12018,12049,type=<NUM>) Again, something seems wrong with the NEW. (safari-0-sheikh,12011,12026,type=<NUM>) is a correct and longer match than (safari,12011,12017,type=<ALPHANUM>) It would be nice to have the source text for these comparisons. Also, a hard vote against StandardAnalyzer2 <g> Default is arguable as well, as this wouldn't be the default analyzer you should use in many cases (don't like standard because of that either). From the latest samples, I would say something is off with the NEW and OLD appears mostly correct. Mark
        Hide
        Mark Miller added a comment -

        By the way...you can see one of the issues here:

        OLD
        (money.cnn.com,1382,1395,type=<HOST>)
        (magazines,1396,1405,type=<ALPHANUM>)
        (fortune,1406,1413,type=<ALPHANUM>)
        (fortune,1414,1421,type=<ALPHANUM>)
        (archive/2007/03/19/8402357,1422,1448,type=<NUM>)
        (index.htm,1449,1458,type=<HOST>)

        NEW
        (/money.cnn.com/magazines/fortune/fortune_archive/2007/03/19/8402357/index.htm,1381,1458,type=<NUM>)

        JavaCC StandardAnalyzer would never output a token that starts with a '/'. It would be cut off. The issues may be involved with how JFlex skips characters that are not part of a match compared to how JavaCC is doing it. Or perhaps the JFlex version is considering '/' and '.' to be ALPHANUM's.

        • Mark
        Show
        Mark Miller added a comment - By the way...you can see one of the issues here: OLD (money.cnn.com,1382,1395,type=<HOST>) (magazines,1396,1405,type=<ALPHANUM>) (fortune,1406,1413,type=<ALPHANUM>) (fortune,1414,1421,type=<ALPHANUM>) (archive/2007/03/19/8402357,1422,1448,type=<NUM>) (index.htm,1449,1458,type=<HOST>) NEW (/money.cnn.com/magazines/fortune/fortune_archive/2007/03/19/8402357/index.htm,1381,1458,type=<NUM>) JavaCC StandardAnalyzer would never output a token that starts with a '/'. It would be cut off. The issues may be involved with how JFlex skips characters that are not part of a match compared to how JavaCC is doing it. Or perhaps the JFlex version is considering '/' and '.' to be ALPHANUM's. Mark
        Hide
        Mark Miller added a comment -

        >When digging deeper into the issues of compatibility with the original StandardAnalyzer, I stumbled upon something strange. Take the following >text:
        >
        >78academyawards/rules/rule02.html,7194,7227,type
        >
        >which was tokenized by the original StandardAnalyzer as one <NUM>. If you look at the definition of the <NUM> token:
        >
        >// every other segment must have at least one digit
        ><NUM: (<ALPHANUM> <P> <HAS_DIGIT>
        > | <HAS_DIGIT> <P> <ALPHANUM>
        > | <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
        > | <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
        > | <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
        > | <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
        > )
        >
        >you'll see that, as explained in the comment, every other segment must have at least one digit. But actually, according to my understanding, this >rule should not match the above text as a whole (and with JFlex it doesn't , actually). Below is the text split by punctuation characters, and it looks >like there is no way of splitting this text into alternating segments, every second of which must have a digit (A = ALPHANUM, H = HAS_DIGIT):
        >
        >78academyawards / rules / rule02 . html , 7194 , 7227 , type
        > H P A P H P A P H P A P H?* (starting from the beginning)
        > H?* P A P H P A (starting from the end)
        >
        >* (would have to be H, but no digits in substring "type" or "html")
        >
        >I have no idea why JavaCC matched the whole text as a <NUM>, JFlex behaved "more correctly" here.

        I think that JavaCC is correct here. Every other segment must have a digit: 78academyawards / rules / rule02 . html
        It looks to me like every other segment does have a digit. First segment has a digit, second does not, third does, fourth does not. Matches NUM correctly.

        Show
        Mark Miller added a comment - >When digging deeper into the issues of compatibility with the original StandardAnalyzer, I stumbled upon something strange. Take the following >text: > >78academyawards/rules/rule02.html,7194,7227,type > >which was tokenized by the original StandardAnalyzer as one <NUM>. If you look at the definition of the <NUM> token: > >// every other segment must have at least one digit ><NUM: (<ALPHANUM> <P> <HAS_DIGIT> > | <HAS_DIGIT> <P> <ALPHANUM> > | <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+ > | <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+ > | <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+ > | <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+ > ) > >you'll see that, as explained in the comment, every other segment must have at least one digit. But actually, according to my understanding, this >rule should not match the above text as a whole (and with JFlex it doesn't , actually). Below is the text split by punctuation characters, and it looks >like there is no way of splitting this text into alternating segments, every second of which must have a digit (A = ALPHANUM, H = HAS_DIGIT): > >78academyawards / rules / rule02 . html , 7194 , 7227 , type > H P A P H P A P H P A P H?* (starting from the beginning) > H?* P A P H P A (starting from the end) > >* (would have to be H, but no digits in substring "type" or "html") > >I have no idea why JavaCC matched the whole text as a <NUM>, JFlex behaved "more correctly" here. I think that JavaCC is correct here. Every other segment must have a digit: 78academyawards / rules / rule02 . html It looks to me like every other segment does have a digit. First segment has a digit, second does not, third does, fourth does not. Matches NUM correctly.
        Hide
        Stanislaw Osinski added a comment -

        Okkkk – only now I realized I made a really silly mistake When using Mark's examples I somehow took the ",type" substring as part of the token image, which made the JavaCC tokenizer look "buggy"... Apologies for the confusion, tomorrow in the morning I'll correct my tests and will see what's happening.

        One more important clarification – the tokenizer from the last patch (jflex-analyzer-r561693-compatibility.txt) has a completely different definition of the <NUM> token – it allows digits in any segment, hence the totally different results. If we want to be compatible with the StandardAnalyzer, we should forget about that patch.

        Mark – have you tried the jflex-analyzer-r560135-patch.txt patch with your wikipedia diff test? That's the early one whose grammar was "dot for dot" translated from the original JavaCC spec – for further patches I did some "optimizations", which seem to have broken the compatibility...

        Incidentally, what was the motivation for requiring the <NUM> token to have numbers only in every second segment and not in any segment?

        Show
        Stanislaw Osinski added a comment - Okkkk – only now I realized I made a really silly mistake When using Mark's examples I somehow took the ",type" substring as part of the token image, which made the JavaCC tokenizer look "buggy"... Apologies for the confusion, tomorrow in the morning I'll correct my tests and will see what's happening. One more important clarification – the tokenizer from the last patch (jflex-analyzer-r561693-compatibility.txt) has a completely different definition of the <NUM> token – it allows digits in any segment, hence the totally different results. If we want to be compatible with the StandardAnalyzer, we should forget about that patch. Mark – have you tried the jflex-analyzer-r560135-patch.txt patch with your wikipedia diff test? That's the early one whose grammar was "dot for dot" translated from the original JavaCC spec – for further patches I did some "optimizations", which seem to have broken the compatibility... Incidentally, what was the motivation for requiring the <NUM> token to have numbers only in every second segment and not in any segment?
        Hide
        Stanislaw Osinski added a comment -

        I've attached another patch with:

        • grammar translated "dot for dot" from JavaCC (no "optimizations" this time)
        • tests extended with Mark's latest compatibility examples

        I hope this time the patch correctly applies to the test class (previously I used Eclipse to create the patch which may not have been a good idea). All unit tests pass, so I hope this time I finally got it right.

        Show
        Stanislaw Osinski added a comment - I've attached another patch with: grammar translated "dot for dot" from JavaCC (no "optimizations" this time) tests extended with Mark's latest compatibility examples I hope this time the patch correctly applies to the test class (previously I used Eclipse to create the patch which may not have been a good idea). All unit tests pass, so I hope this time I finally got it right.
        Hide
        Michael McCandless added a comment -

        The TestStandardAnalyzer.java patch still fails to apply – looks like
        the same thing as before (the unicode chars under "Korean words" were
        replaced by ?'s somewhere along the way).

        And, somehow, many of the files are dup'd (appear twice) in the
        patch (eg StandardTokenizerImpl.jflex, StandardFilter.java, etc.).
        I'm not sure whether first or 2nd one is the one to keep.

        Show
        Michael McCandless added a comment - The TestStandardAnalyzer.java patch still fails to apply – looks like the same thing as before (the unicode chars under "Korean words" were replaced by ?'s somewhere along the way). And, somehow, many of the files are dup'd (appear twice) in the patch (eg StandardTokenizerImpl.jflex, StandardFilter.java, etc.). I'm not sure whether first or 2nd one is the one to keep.
        Hide
        Michael McCandless added a comment -

        > The TestStandardAnalyzer.java patch still fails to apply – looks like
        > the same thing as before (the unicode chars under "Korean words" were
        > replaced by ?'s somewhere along the way).

        WOOPS! This was my bad. I was clicking on the patch and then
        copy/pasting the text out of the browser into a local file, and this
        step (ugh) introduces the ?'s. If I instead right-click on the link
        and have browser directly save the file then it works perfectly!

        Sorry for the confusion.

        Show
        Michael McCandless added a comment - > The TestStandardAnalyzer.java patch still fails to apply – looks like > the same thing as before (the unicode chars under "Korean words" were > replaced by ?'s somewhere along the way). WOOPS! This was my bad. I was clicking on the patch and then copy/pasting the text out of the browser into a local file, and this step (ugh) introduces the ?'s. If I instead right-click on the link and have browser directly save the file then it works perfectly! Sorry for the confusion.
        Hide
        Stanislaw Osinski added a comment -

        This time I used Tortoise, but it made things worse I'll create the patch the usual way and e-mail it directly to you to see if it's not JIRA's fault (I don't see the '?' characters on my local machine).

        Show
        Stanislaw Osinski added a comment - This time I used Tortoise, but it made things worse I'll create the patch the usual way and e-mail it directly to you to see if it's not JIRA's fault (I don't see the '?' characters on my local machine).
        Hide
        Stanislaw Osinski added a comment -

        One more try – this time without duplicated entries (no idea why Tortoise did this...).

        Show
        Stanislaw Osinski added a comment - One more try – this time without duplicated entries (no idea why Tortoise did this...).
        Hide
        Michael McCandless added a comment -

        Super, I will compare tokens on Wikipedia...

        Show
        Michael McCandless added a comment - Super, I will compare tokens on Wikipedia...
        Hide
        Michael McCandless added a comment -

        First 1000 documents in wikipedia show absolutely no difference – I think this one is it!

        Show
        Michael McCandless added a comment - First 1000 documents in wikipedia show absolutely no difference – I think this one is it!
        Hide
        Stanislaw Osinski added a comment -

        Good news – thanks for the test!

        Show
        Stanislaw Osinski added a comment - Good news – thanks for the test!
        Hide
        Michael McCandless added a comment -

        OK I can commit this. Does anyone see any reason not to? I will wait a few days then commit...

        Thank you Stanislaw!

        Show
        Michael McCandless added a comment - OK I can commit this. Does anyone see any reason not to? I will wait a few days then commit... Thank you Stanislaw!
        Hide
        Michael McCandless added a comment -

        I was gearing up to commit this, but realized the copyright on top of
        StandardTokenizer.java isn't the standard Apache header; it's this
        one:

        /*

        • Carrot2 project.
          *
        • Copyright (C) 2002-2007, Dawid Weiss, Stanisław Osiński.
        • Portions (C) Contributors listed in "carrot2.CONTRIBUTORS" file.
        • All rights reserved.
          *
        • Refer to the full license file "carrot2.LICENSE"
        • in the root folder of the repository checkout or at:
        • http://www.carrot2.org/carrot2.LICENSE
          */

        I believe (according to http://www.apache.org/legal/src-headers.html),
        we need to replace the above with the standard Apache header and then
        maybe move this one into the NOTICE.txt. Is that right? Stanislaw,
        is that OK?

        I would also add the Apache header into StandardAnalyzerImpl.jflex.

        Show
        Michael McCandless added a comment - I was gearing up to commit this, but realized the copyright on top of StandardTokenizer.java isn't the standard Apache header; it's this one: /* Carrot2 project. * Copyright (C) 2002-2007, Dawid Weiss, Stanisław Osiński. Portions (C) Contributors listed in "carrot2.CONTRIBUTORS" file. All rights reserved. * Refer to the full license file "carrot2.LICENSE" in the root folder of the repository checkout or at: http://www.carrot2.org/carrot2.LICENSE */ I believe (according to http://www.apache.org/legal/src-headers.html ), we need to replace the above with the standard Apache header and then maybe move this one into the NOTICE.txt. Is that right? Stanislaw, is that OK? I would also add the Apache header into StandardAnalyzerImpl.jflex.
        Hide
        Stanislaw Osinski added a comment -

        Absolutely – the header was there only because I used Carrot2's grammar to start with and forgot to remove it. You can put the Apache header on all the files, NOTICE.txt is not necessary either. Apologies for confusion

        Show
        Stanislaw Osinski added a comment - Absolutely – the header was there only because I used Carrot2's grammar to start with and forgot to remove it. You can put the Apache header on all the files, NOTICE.txt is not necessary either. Apologies for confusion
        Hide
        Michael McCandless added a comment -

        Super, thanks!

        Show
        Michael McCandless added a comment - Super, thanks!
        Hide
        Michael McCandless added a comment -

        OK I committed this! Thank you Stanislaw!

        I ran a quick perf test on Wikipedia (first 50K docs only) and found
        the new StandardTokenizer is ~6X faster – awesome

        I made these small additional changes over the final patch before
        committing:

        • I removed StandardAnalyzer.html "grammar doc" generation from
          build.xml since it was using jjdoc. Stanislaw, is there something
          in jflex that can generated a BNF description of the grammar as
          HTML?
        • I removed the @author tag from StandardTokenizer.java: we are
          removing all such tags and instead giving credit in CHANGES.txt.
        • I removed the whitespace-only diffs from common-build.xml &
          build.xml.
        • I put back the big comment that describes this tokenizer in
          StandardTokenizer.java.
        • Put standard Apache copyright headers in all sources.
        Show
        Michael McCandless added a comment - OK I committed this! Thank you Stanislaw! I ran a quick perf test on Wikipedia (first 50K docs only) and found the new StandardTokenizer is ~6X faster – awesome I made these small additional changes over the final patch before committing: I removed StandardAnalyzer.html "grammar doc" generation from build.xml since it was using jjdoc. Stanislaw, is there something in jflex that can generated a BNF description of the grammar as HTML? I removed the @author tag from StandardTokenizer.java: we are removing all such tags and instead giving credit in CHANGES.txt. I removed the whitespace-only diffs from common-build.xml & build.xml. I put back the big comment that describes this tokenizer in StandardTokenizer.java. Put standard Apache copyright headers in all sources.
        Hide
        Stanislaw Osinski added a comment -

        > * I removed StandardAnalyzer.html "grammar doc" generation from
        > build.xml since it was using jjdoc. Stanislaw, is there something
        > in jflex that can generated a BNF description of the grammar as
        > HTML?

        I've had a look at the JFlex docs and there doesn't seem to be such a tool for JFlex, I'm afraid.

        Show
        Stanislaw Osinski added a comment - > * I removed StandardAnalyzer.html "grammar doc" generation from > build.xml since it was using jjdoc. Stanislaw, is there something > in jflex that can generated a BNF description of the grammar as > HTML? I've had a look at the JFlex docs and there doesn't seem to be such a tool for JFlex, I'm afraid.
        Hide
        Michael McCandless added a comment -

        > > * I removed StandardAnalyzer.html "grammar doc" generation from
        > > build.xml since it was using jjdoc. Stanislaw, is there something
        > > > in jflex that can generated a BNF description of the grammar as
        > > > HTML?

        > I've had a look at the JFlex docs and there doesn't seem to be such
        > a tool for JFlex, I'm afraid.

        OK, I think we can just live without this. Thanks!

        Show
        Michael McCandless added a comment - > > * I removed StandardAnalyzer.html "grammar doc" generation from > > build.xml since it was using jjdoc. Stanislaw, is there something > > > in jflex that can generated a BNF description of the grammar as > > > HTML? > I've had a look at the JFlex docs and there doesn't seem to be such > a tool for JFlex, I'm afraid. OK, I think we can just live without this. Thanks!

          People

          • Assignee:
            Unassigned
            Reporter:
            Stanislaw Osinski
          • Votes:
            2 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development