Details

    • Type: Task Task
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.10, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      JFlex 1.6, to be released shortly, will have direct support for supplementary code points - JFlex 1.5 and earlier only support code points in the BMP.

      We can drop the use of ICU4J to generate surrogate pairs to extend our JFlex scanner specifications to handle supplementary code points.

      1. LUCENE-5770.patch
        5.18 MB
        Steve Rowe
      2. LUCENE-5770.patch
        11 kB
        Steve Rowe

        Activity

        Hide
        Steve Rowe added a comment -

        Preliminary patch modifying the specifications for StandardTokenizer, UAX29URLEmailTokenizer and HTMLStripCharFilter, the three JFlex specifications that use ICU4J-generated supplementary code points.

        When I manually generate the scanners for these using JFlex 1.6.0-SNAPSHOT, some tests are failing, I haven't looked at them yet:

           [junit4] Tests with failures:
           [junit4]   - org.apache.lucene.analysis.charfilter.HTMLStripCharFilterTest.testSupplementaryCharsInTags
           [junit4]   - org.apache.lucene.analysis.charfilter.HTMLStripCharFilterTest.testRandomHugeStrings
           [junit4]   - org.apache.lucene.analysis.charfilter.HTMLStripCharFilterTest.testRandom
           [junit4]   - org.apache.lucene.analysis.core.TestRandomChains.testRandomChainsWithLargeStrings
           [junit4]   - org.apache.lucene.analysis.core.TestFactories.test
           [junit4]   - org.apache.lucene.analysis.core.TestRandomChains.testRandomChain
        

        The Standard and UAX29URLEmail tests are all passing, including WordBreakTestUnicode_6_3_0, which is built from the Unicode test data for the UAX#29 word break rules, and includes some tests with supplementary code points.

        Show
        Steve Rowe added a comment - Preliminary patch modifying the specifications for StandardTokenizer, UAX29URLEmailTokenizer and HTMLStripCharFilter, the three JFlex specifications that use ICU4J-generated supplementary code points. When I manually generate the scanners for these using JFlex 1.6.0-SNAPSHOT, some tests are failing, I haven't looked at them yet: [junit4] Tests with failures: [junit4] - org.apache.lucene.analysis.charfilter.HTMLStripCharFilterTest.testSupplementaryCharsInTags [junit4] - org.apache.lucene.analysis.charfilter.HTMLStripCharFilterTest.testRandomHugeStrings [junit4] - org.apache.lucene.analysis.charfilter.HTMLStripCharFilterTest.testRandom [junit4] - org.apache.lucene.analysis.core.TestRandomChains.testRandomChainsWithLargeStrings [junit4] - org.apache.lucene.analysis.core.TestFactories.test [junit4] - org.apache.lucene.analysis.core.TestRandomChains.testRandomChain The Standard and UAX29URLEmail tests are all passing, including WordBreakTestUnicode_6_3_0, which is built from the Unicode test data for the UAX#29 word break rules, and includes some tests with supplementary code points.
        Hide
        Steve Rowe added a comment -

        To assess relative performance of the modified StandardTokenizerImpl, I ran luceneutil's TestAnalyzerPerf (the history of results of the 4.x version of which are shown here: http://people.apache.org/~mikemccand/lucenebench/analyzers.html).

        Here are the raw results of ten runs (after a run to populate the OS filesystem cache) on Linux with Oracle 1.7.0_60, against unmodified trunk, using enwiki-20130102-lines.txt:

        Standard time=48581.34 msec hash=-16468587987622665 tokens=203498795
        Standard time=48103.02 msec hash=-16468587987622665 tokens=203498795
        Standard time=44514.19 msec hash=-16468587987622665 tokens=203498795
        Standard time=48997.35 msec hash=-16468587987622665 tokens=203498795
        Standard time=47794.26 msec hash=-16468587987622665 tokens=203498795
        Standard time=48973.45 msec hash=-16468587987622665 tokens=203498795
        Standard time=52409.88 msec hash=-16468587987622665 tokens=203498795
        Standard time=49674.48 msec hash=-16468587987622665 tokens=203498795
        Standard time=48257.42 msec hash=-16468587987622665 tokens=203498795
        Standard time=48075.62 msec hash=-16468587987622665 tokens=203498795
            Mean time=48538.10 msec
        Mean toks/sec=4,192,557
        

        and the patched results:

        Standard time=49561.77 msec hash=-16468594357435165 tokens=203498791
        Standard time=49465.50 msec hash=-16468594357435165 tokens=203498791
        Standard time=50194.16 msec hash=-16468594357435165 tokens=203498791
        Standard time=48548.19 msec hash=-16468594357435165 tokens=203498791
        Standard time=49449.01 msec hash=-16468594357435165 tokens=203498791
        Standard time=52377.06 msec hash=-16468594357435165 tokens=203498791
        Standard time=52433.60 msec hash=-16468594357435165 tokens=203498791
        Standard time=50495.17 msec hash=-16468594357435165 tokens=203498791
        Standard time=46098.29 msec hash=-16468594357435165 tokens=203498791
        Standard time=48078.95 msec hash=-16468594357435165 tokens=203498791
            Mean time=49670.17 msec
        Mean toks/sec=4,097,002
        

        Comparing the mean throughput numbers, the patched version is ~2.3% slower.

        Comparing the highest throughput numbers, the patched version is ~3.5% slower.

        I believe the reason for the relative slowdown is the use of Java's codepoint APIs (Character.codePointAt(), .charCount(), etc.) over the input char[] buffer. I think this is an acceptable reduction in performance in exchange for the more easily maintainable single-source specifications.

        The number of tokens, and the hash (calculated over the token text and their positions and offsets) differ slightly - I tracked this down to an unrelated change I made to the specification: I changed the ComplexContext rule, a specialization for Southeast Asian scripts, to include following WB:Format and/or WB:Extend characters, as is done with most other rules in the specification, following the UAX#29 WB4 rule. All tokenization differences are caused by the orginal specification triggering breaks at U+200C ZERO WIDTH NON-JOINER, which is a WB:Extend character, after and between Myanmar characters. When I reverted changes to that rule in the patched version, the same hash and number of tokens is produced as in the original unpatched version.

        Show
        Steve Rowe added a comment - To assess relative performance of the modified StandardTokenizerImpl, I ran luceneutil's TestAnalyzerPerf (the history of results of the 4.x version of which are shown here: http://people.apache.org/~mikemccand/lucenebench/analyzers.html ). Here are the raw results of ten runs (after a run to populate the OS filesystem cache) on Linux with Oracle 1.7.0_60, against unmodified trunk, using enwiki-20130102-lines.txt : Standard time=48581.34 msec hash=-16468587987622665 tokens=203498795 Standard time=48103.02 msec hash=-16468587987622665 tokens=203498795 Standard time=44514.19 msec hash=-16468587987622665 tokens=203498795 Standard time=48997.35 msec hash=-16468587987622665 tokens=203498795 Standard time=47794.26 msec hash=-16468587987622665 tokens=203498795 Standard time=48973.45 msec hash=-16468587987622665 tokens=203498795 Standard time=52409.88 msec hash=-16468587987622665 tokens=203498795 Standard time=49674.48 msec hash=-16468587987622665 tokens=203498795 Standard time=48257.42 msec hash=-16468587987622665 tokens=203498795 Standard time=48075.62 msec hash=-16468587987622665 tokens=203498795 Mean time=48538.10 msec Mean toks/sec=4,192,557 and the patched results: Standard time=49561.77 msec hash=-16468594357435165 tokens=203498791 Standard time=49465.50 msec hash=-16468594357435165 tokens=203498791 Standard time=50194.16 msec hash=-16468594357435165 tokens=203498791 Standard time=48548.19 msec hash=-16468594357435165 tokens=203498791 Standard time=49449.01 msec hash=-16468594357435165 tokens=203498791 Standard time=52377.06 msec hash=-16468594357435165 tokens=203498791 Standard time=52433.60 msec hash=-16468594357435165 tokens=203498791 Standard time=50495.17 msec hash=-16468594357435165 tokens=203498791 Standard time=46098.29 msec hash=-16468594357435165 tokens=203498791 Standard time=48078.95 msec hash=-16468594357435165 tokens=203498791 Mean time=49670.17 msec Mean toks/sec=4,097,002 Comparing the mean throughput numbers, the patched version is ~2.3% slower. Comparing the highest throughput numbers, the patched version is ~3.5% slower. I believe the reason for the relative slowdown is the use of Java's codepoint APIs ( Character.codePointAt() , .charCount() , etc.) over the input char[] buffer. I think this is an acceptable reduction in performance in exchange for the more easily maintainable single-source specifications. The number of tokens, and the hash (calculated over the token text and their positions and offsets) differ slightly - I tracked this down to an unrelated change I made to the specification: I changed the ComplexContext rule, a specialization for Southeast Asian scripts, to include following WB:Format and/or WB:Extend characters, as is done with most other rules in the specification, following the UAX#29 WB4 rule . All tokenization differences are caused by the orginal specification triggering breaks at U+200C ZERO WIDTH NON-JOINER, which is a WB:Extend character, after and between Myanmar characters. When I reverted changes to that rule in the patched version, the same hash and number of tokens is produced as in the original unpatched version.
        Hide
        Steve Rowe added a comment -

        I was curious whether Java8 would reduce or eliminate the penalty for using the codepoint APIs, so I ran luceneutil's TestAnalyzerPerf under Oracle Java 1.8.0_05, again on LInux. The raw results against unpatched trunk:

        Standard time=27605.51 msec hash=-16468587987622665 tokens=203498795
        Standard time=27767.41 msec hash=-16468587987622665 tokens=203498795
        Standard time=29705.14 msec hash=-16468587987622665 tokens=203498795
        Standard time=30312.27 msec hash=-16468587987622665 tokens=203498795
        Standard time=28091.85 msec hash=-16468587987622665 tokens=203498795
        Standard time=29408.59 msec hash=-16468587987622665 tokens=203498795
        Standard time=28107.20 msec hash=-16468587987622665 tokens=203498795
        Standard time=28228.80 msec hash=-16468587987622665 tokens=203498795
        Standard time=28487.87 msec hash=-16468587987622665 tokens=203498795
        Standard time=31785.43 msec hash=-16468587987622665 tokens=203498795
            Mean time=28950.01 msec
        Mean toks/sec=7,029,316
        

        And against the patched version (I left the ComplexContext rule reverted, so the hashes and token counts match):

        Standard time=31967.65 msec hash=-16468587987622665 tokens=203498795
        Standard time=29123.18 msec hash=-16468587987622665 tokens=203498795
        Standard time=28408.14 msec hash=-16468587987622665 tokens=203498795
        Standard time=29412.19 msec hash=-16468587987622665 tokens=203498795
        Standard time=30255.32 msec hash=-16468587987622665 tokens=203498795
        Standard time=31915.55 msec hash=-16468587987622665 tokens=203498795
        Standard time=30301.20 msec hash=-16468587987622665 tokens=203498795
        Standard time=32921.60 msec hash=-16468587987622665 tokens=203498795
        Standard time=28528.48 msec hash=-16468587987622665 tokens=203498795
        Standard time=30649.49 msec hash=-16468587987622665 tokens=203498795
            Mean time=30348.28 msec
        Mean toks/sec=6,705,447
        

        Comparing the mean throughput numbers, the patched version is ~4.6% slower.

        Comparing the highest throughput numbers, the patched version is ~1.1% slower.

        But the huge result here is that StandardAnalyzer is way faster under Java8 on Linux than under Java7: 68% better throughput on average for the unpatched version. I haven't run the benchmark on other platforms, but I did run a throughput test over 20k Reuters docs with lucene/benchmark on Windows 7 using Oracle 1.8.0_05, and it was actually somewhat slower, so it's clearly not the case that there are speedups to be had everywhere by upgrading to Java8.

        I did one run of TestAnalyzerPerf on Linux over all of the analyzers it benchmarks, instead of just over StandardAnalyzer, and each analyzer shows serious improvements on Java8 over Java7:

        Oracle 1.7.0_60:

        Standard time=48173.65 msec hash=-16468587987622665 tokens=203498795
        LowerCase time=42118.84 msec hash=-4828213998132430 tokens=184607939
        EdgeNGrams time=51357.61 msec hash=1432428577478099 tokens=504918366
        Shingles time=67035.01 msec hash=-21741319767311116 tokens=369115878
        WordDelimiterFilter time=50846.88 msec hash=-18262747001660775 tokens=219918096
        

        Oracle Java 1.8.0_05:

        (63% faster) Standard time=29627.88 msec hash=-16468587987622665 tokens=203498795
        (86% faster) LowerCase time=22692.98 msec hash=-4828213998132430 tokens=184607939
        (30% faster) EdgeNGrams time=39463.84 msec hash=1432428577478099 tokens=504918366
        (31% faster) Shingles time=51205.34 msec hash=-21741319767311116 tokens=369115878
        (44% faster) WordDelimiterFilter time=35398.86 msec hash=-18262749927564663 tokens=219918098
        
        Show
        Steve Rowe added a comment - I was curious whether Java8 would reduce or eliminate the penalty for using the codepoint APIs, so I ran luceneutil's TestAnalyzerPerf under Oracle Java 1.8.0_05, again on LInux. The raw results against unpatched trunk: Standard time=27605.51 msec hash=-16468587987622665 tokens=203498795 Standard time=27767.41 msec hash=-16468587987622665 tokens=203498795 Standard time=29705.14 msec hash=-16468587987622665 tokens=203498795 Standard time=30312.27 msec hash=-16468587987622665 tokens=203498795 Standard time=28091.85 msec hash=-16468587987622665 tokens=203498795 Standard time=29408.59 msec hash=-16468587987622665 tokens=203498795 Standard time=28107.20 msec hash=-16468587987622665 tokens=203498795 Standard time=28228.80 msec hash=-16468587987622665 tokens=203498795 Standard time=28487.87 msec hash=-16468587987622665 tokens=203498795 Standard time=31785.43 msec hash=-16468587987622665 tokens=203498795 Mean time=28950.01 msec Mean toks/sec=7,029,316 And against the patched version (I left the ComplexContext rule reverted, so the hashes and token counts match): Standard time=31967.65 msec hash=-16468587987622665 tokens=203498795 Standard time=29123.18 msec hash=-16468587987622665 tokens=203498795 Standard time=28408.14 msec hash=-16468587987622665 tokens=203498795 Standard time=29412.19 msec hash=-16468587987622665 tokens=203498795 Standard time=30255.32 msec hash=-16468587987622665 tokens=203498795 Standard time=31915.55 msec hash=-16468587987622665 tokens=203498795 Standard time=30301.20 msec hash=-16468587987622665 tokens=203498795 Standard time=32921.60 msec hash=-16468587987622665 tokens=203498795 Standard time=28528.48 msec hash=-16468587987622665 tokens=203498795 Standard time=30649.49 msec hash=-16468587987622665 tokens=203498795 Mean time=30348.28 msec Mean toks/sec=6,705,447 Comparing the mean throughput numbers, the patched version is ~4.6% slower. Comparing the highest throughput numbers, the patched version is ~1.1% slower. But the huge result here is that StandardAnalyzer is way faster under Java8 on Linux than under Java7: 68% better throughput on average for the unpatched version. I haven't run the benchmark on other platforms, but I did run a throughput test over 20k Reuters docs with lucene/benchmark on Windows 7 using Oracle 1.8.0_05, and it was actually somewhat slower, so it's clearly not the case that there are speedups to be had everywhere by upgrading to Java8. I did one run of TestAnalyzerPerf on Linux over all of the analyzers it benchmarks, instead of just over StandardAnalyzer , and each analyzer shows serious improvements on Java8 over Java7: Oracle 1.7.0_60: Standard time=48173.65 msec hash=-16468587987622665 tokens=203498795 LowerCase time=42118.84 msec hash=-4828213998132430 tokens=184607939 EdgeNGrams time=51357.61 msec hash=1432428577478099 tokens=504918366 Shingles time=67035.01 msec hash=-21741319767311116 tokens=369115878 WordDelimiterFilter time=50846.88 msec hash=-18262747001660775 tokens=219918096 Oracle Java 1.8.0_05: (63% faster) Standard time=29627.88 msec hash=-16468587987622665 tokens=203498795 (86% faster) LowerCase time=22692.98 msec hash=-4828213998132430 tokens=184607939 (30% faster) EdgeNGrams time=39463.84 msec hash=1432428577478099 tokens=504918366 (31% faster) Shingles time=51205.34 msec hash=-21741319767311116 tokens=369115878 (44% faster) WordDelimiterFilter time=35398.86 msec hash=-18262749927564663 tokens=219918098
        Hide
        Robert Muir added a comment -

        I'm not worried about it. We should also consider the data: wikipedia is a little strange in that it has an abnormally high presence of these characters versus most content.

        I tried to optimize the fast path in CharTokenizer/LowerCaseFilter etc just as an experiment (because when you look at codePointAt/count you see a lot of checks etc) but when i ran it on data from blogs/tweets etc it made no difference at all: i was also struggling with noise in mike's benchmark.

        Show
        Robert Muir added a comment - I'm not worried about it. We should also consider the data: wikipedia is a little strange in that it has an abnormally high presence of these characters versus most content. I tried to optimize the fast path in CharTokenizer/LowerCaseFilter etc just as an experiment (because when you look at codePointAt/count you see a lot of checks etc) but when i ran it on data from blogs/tweets etc it made no difference at all: i was also struggling with noise in mike's benchmark.
        Hide
        Robert Muir added a comment -

        btw as i see you have this targeted at 4.10, ill go make the 4.9 branch now

        Show
        Robert Muir added a comment - btw as i see you have this targeted at 4.10, ill go make the 4.9 branch now
        Hide
        Steve Rowe added a comment -

        btw as i see you have this targeted at 4.10, ill go make the 4.9 branch now

        +1

        Show
        Steve Rowe added a comment - btw as i see you have this targeted at 4.10, ill go make the 4.9 branch now +1
        Hide
        Steve Rowe added a comment - - edited

        Full patch - JFlex 1.6 has been released, and is integrated into the build. All tests now pass.

        I'll commit shortly, once ant precommit finishes.

        Show
        Steve Rowe added a comment - - edited Full patch - JFlex 1.6 has been released, and is integrated into the build. All tests now pass. I'll commit shortly, once ant precommit finishes.
        Hide
        ASF subversion and git services added a comment -

        Commit 1608134 from Steve Rowe in branch 'dev/trunk'
        [ https://svn.apache.org/r1608134 ]

        LUCENE-5770: Upgrade to JFlex 1.6, which has direct support for supplementary code points - as a result, ICU4J is no longer used to generate surrogate pairs to augment JFlex scanner specifications.

        Show
        ASF subversion and git services added a comment - Commit 1608134 from Steve Rowe in branch 'dev/trunk' [ https://svn.apache.org/r1608134 ] LUCENE-5770 : Upgrade to JFlex 1.6, which has direct support for supplementary code points - as a result, ICU4J is no longer used to generate surrogate pairs to augment JFlex scanner specifications.
        Hide
        ASF subversion and git services added a comment -

        Commit 1608313 from Steve Rowe in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1608313 ]

        LUCENE-5770: Upgrade to JFlex 1.6, which has direct support for supplementary code points - as a result, ICU4J is no longer used to generate surrogate pairs to augment JFlex scanner specifications. (merged trunk r1608134)

        Show
        ASF subversion and git services added a comment - Commit 1608313 from Steve Rowe in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1608313 ] LUCENE-5770 : Upgrade to JFlex 1.6, which has direct support for supplementary code points - as a result, ICU4J is no longer used to generate surrogate pairs to augment JFlex scanner specifications. (merged trunk r1608134)
        Hide
        Steve Rowe added a comment -

        Committed to trunk and branch_4x.

        On branch_4x, I set up the previous versions of StandardTokenizerImpl and UAX29URLEmailTokenizerImpl, along with ClassicTokenizer, to be generated using JFlex 1.5.1, so that the grammars don't have to be modified.

        Show
        Steve Rowe added a comment - Committed to trunk and branch_4x. On branch_4x, I set up the previous versions of StandardTokenizerImpl and UAX29URLEmailTokenizerImpl, along with ClassicTokenizer, to be generated using JFlex 1.5.1, so that the grammars don't have to be modified.

          People

          • Assignee:
            Steve Rowe
            Reporter:
            Steve Rowe
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development