Lucene - Core
  1. Lucene - Core
  2. LUCENE-2084

remove Byte/CharBuffer wrapping for collation key generation

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: modules/other
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      We can remove the overhead of ByteBuffer and CharBuffer wrapping in CollationKeyFilter and ICUCollationKeyFilter.

      this patch moves the logic in IndexableBinaryStringTools into char[],int,int and byte[],int,int based methods, with the previous Byte/CharBuffer methods delegating to these.
      Previously, the Byte/CharBuffer methods required a backing array anyway.

      1. collation.benchmark.tar.bz2
        1.94 MB
        Steve Rowe
      2. LUCENE-2084.patch
        40 kB
        Robert Muir
      3. LUCENE-2084.patch
        40 kB
        Robert Muir
      4. LUCENE-2084.patch
        39 kB
        Robert Muir
      5. LUCENE-2084.patch
        39 kB
        Robert Muir
      6. LUCENE-2084.patch
        20 kB
        Steve Rowe
      7. LUCENE-2084.patch
        20 kB
        Robert Muir
      8. TopTFWikipediaWords.tar.bz2
        8 kB
        Steve Rowe

        Activity

        Hide
        Steve Rowe added a comment -

        synched to current trunk, after the LUCENE-2124 move

        Show
        Steve Rowe added a comment - synched to current trunk, after the LUCENE-2124 move
        Hide
        Steve Rowe added a comment -

        Atached collation.benchmark.tar.bz2, which contains stuff to run an analysis-only contrib benchmark for the (ICU)CollationKeyAnalyzers over 4 languages: English, French, German, and Ukrainian.

        Included are:

        1. For each language, a line-doc containing the most frequent 100K words from a corresponding Wikipedia dump from November 2009;
        2. For each language, Java code for a no-argument analyzer callable from a benchmark alg, that specializes (ICU)CollationKeyAnalyzer and uses PerFieldAnalyzerWrapper to only run it over the line-doc body field
        3. A script to compile and jarify the above analyzers
        4. A benchmark alg running 5 iterations of 10 repetitions of analysis only over the line-doc for each language
        5. A script to find the minimum elapsed time for each combination, and output the results as a JIRA table
        6. A script to run the previous two scripts once for each of three JDK versions
        7. A script to compare the output of the above script before and after applying the attached patch removing Char/ByteBuffer wrapping, and output the result as a JIRA table
        Show
        Steve Rowe added a comment - Atached collation.benchmark.tar.bz2, which contains stuff to run an analysis-only contrib benchmark for the (ICU)CollationKeyAnalyzers over 4 languages: English, French, German, and Ukrainian. Included are: For each language, a line-doc containing the most frequent 100K words from a corresponding Wikipedia dump from November 2009; For each language, Java code for a no-argument analyzer callable from a benchmark alg, that specializes (ICU)CollationKeyAnalyzer and uses PerFieldAnalyzerWrapper to only run it over the line-doc body field A script to compile and jarify the above analyzers A benchmark alg running 5 iterations of 10 repetitions of analysis only over the line-doc for each language A script to find the minimum elapsed time for each combination, and output the results as a JIRA table A script to run the previous two scripts once for each of three JDK versions A script to compare the output of the above script before and after applying the attached patch removing Char/ByteBuffer wrapping, and output the result as a JIRA table
        Hide
        Steve Rowe added a comment -

        To run the benchmark:

        1. Unpack collation.benchmark.tar.bz2 in a full Lucene-java tree into the contrib/benchmark/ directory. All contents will be put under a new directory named collation/.
        2. Compile and jarify the localized (ICU)CollationKeyAnalyzer code: from the collation/ directory, run the script build.test.analyzer.jar.sh.
        3. From an unpatched java/trunk/, build Lucene's jars: ant clean package.
        4. From the contrib/benchmark/ directory, run the collation benchmark: collation/run-benchmark.sh > unpatched.collation.bm.table.txt
        5. Apply the attached patch to the Lucene-java tree
        6. From java/trunk/, build Lucene's jars: ant clean package.
        7. From the contrib/benchmark/} directory, run the collation benchmark: {{collation/run-benchmark.sh > patched.collation.bm.table.txt
        8. Produce the comparison table: collation/compare.collation.benchmark.tables.pl unpatched.collation.bm.table.txt patched.collation.bm.table.txt > collation.diff.bm.table.txt
        Show
        Steve Rowe added a comment - To run the benchmark: Unpack collation.benchmark.tar.bz2 in a full Lucene-java tree into the contrib/benchmark/ directory. All contents will be put under a new directory named collation/ . Compile and jarify the localized (ICU)CollationKeyAnalyzer code: from the collation/ directory, run the script build.test.analyzer.jar.sh . From an unpatched java/trunk/ , build Lucene's jars: ant clean package . From the contrib/benchmark/ directory, run the collation benchmark: collation/run-benchmark.sh > unpatched.collation.bm.table.txt Apply the attached patch to the Lucene-java tree From java/trunk/ , build Lucene's jars: ant clean package . From the contrib/benchmark/} directory, run the collation benchmark: {{collation/run-benchmark.sh > patched.collation.bm.table.txt Produce the comparison table: collation/compare.collation.benchmark.tables.pl unpatched.collation.bm.table.txt patched.collation.bm.table.txt > collation.diff.bm.table.txt
        Hide
        Robert Muir added a comment -

        Steven this looks cool. trying to coerce my little netbook into running this thing now but I took a glance already.

        Do you think after this issue is resolved (whether it helps or doesn't help/won't fix either way) that we should open a separate issue to work on committing the benchmark so we have collation benchmarks for the future? Seems like it would be good to have in the future.

        Show
        Robert Muir added a comment - Steven this looks cool. trying to coerce my little netbook into running this thing now but I took a glance already. Do you think after this issue is resolved (whether it helps or doesn't help/won't fix either way) that we should open a separate issue to work on committing the benchmark so we have collation benchmarks for the future? Seems like it would be good to have in the future.
        Hide
        Steve Rowe added a comment -

        Here are the unpatched results I got - these look quite similar to the results I posted from a custom (non-contrib-benchmark) benchmark in the description of LUCENE-1719 :

        Sun JVM Language java.text ICU4J KeywordAnalyzer ICU4J Improvement
        1.5.0_15 (32-bit) English 9.00s 4.89s 2.90s 207%
        1.5.0_15 (32-bit) French 10.64s 5.12s 2.95s 254%
        1.5.0_15 (32-bit) German 10.19s 5.19s 2.97s 225%
        1.5.0_15 (32-bit) Ukrainian 13.66s 7.20s 2.96s 152%
        Sun JVM Language java.text ICU4J KeywordAnalyzer ICU4J Improvement
        1.5.0_15 (64-bit) English 5.97s 2.55s 1.50s 326%
        1.5.0_15 (64-bit) French 6.86s 2.74s 1.56s 349%
        1.5.0_15 (64-bit) German 6.85s 2.76s 1.59s 350%
        1.5.0_15 (64-bit) Ukrainian 9.56s 4.01s 1.56s 227%
        Sun JVM Language java.text ICU4J KeywordAnalyzer ICU4J Improvement
        1.6.0_13 (64-bit) English 3.04s 2.06s 1.07s 99%
        1.6.0_13 (64-bit) French 3.58s 2.04s 1.14s 171%
        1.6.0_13 (64-bit) German 3.35s 2.22s 1.14s 105%
        1.6.0_13 (64-bit) Ukrainian 4.48s 2.94s 1.21s 89%

        Here are the results after applying the synced-to-trunk patch:

        Sun JVM Language java.text ICU4J KeywordAnalyzer ICU4J Improvement
        1.5.0_15 (32-bit) English 8.73s 4.61s 2.90s 241%
        1.5.0_15 (32-bit) French 10.38s 4.87s 2.94s 285%
        1.5.0_15 (32-bit) German 9.95s 4.94s 2.97s 254%
        1.5.0_15 (32-bit) Ukrainian 13.37s 6.91s 2.90s 161%
        Sun JVM Language java.text ICU4J KeywordAnalyzer ICU4J Improvement
        1.5.0_15 (64-bit) English 5.78s 2.65s 1.57s 290%
        1.5.0_15 (64-bit) French 6.74s 2.74s 1.64s 364%
        1.5.0_15 (64-bit) German 6.69s 2.86s 1.66s 319%
        1.5.0_15 (64-bit) Ukrainian 9.40s 4.18s 1.62s 204%
        Sun JVM Language java.text ICU4J KeywordAnalyzer ICU4J Improvement
        1.6.0_13 (64-bit) English 3.06s 1.82s 1.09s 170%
        1.6.0_13 (64-bit) French 3.36s 1.88s 1.16s 206%
        1.6.0_13 (64-bit) German 3.40s 1.95s 1.14s 179%
        1.6.0_13 (64-bit) Ukrainian 4.33s 2.65s 1.21s 117%

        And here is a comparison of the two:

        Sun JVM Language java.text improvement ICU4J improvement
        1.5.0_15 (32-bit) English 5.1% 16.8%
        1.5.0_15 (32-bit) French 3.8% 12.9%
        1.5.0_15 (32-bit) German 3.9% 13.1%
        1.5.0_15 (32-bit) Ukrainian 2.6% 6.2%
        Sun JVM Language java.text improvement ICU4J improvement
        1.5.0_15 (64-bit) English 6.6% -2.2%
        1.5.0_15 (64-bit) French 4.4% 7.7%
        1.5.0_15 (64-bit) German 5.0% -2.0%
        1.5.0_15 (64-bit) Ukrainian 3.3% -3.7%
        Sun JVM Language java.text improvement ICU4J improvement
        1.6.0_13 (64-bit) English 0.5% 36.1%
        1.6.0_13 (64-bit) French 11.4% 25.5%
        1.6.0_13 (64-bit) German -1.7% 33.8%
        1.6.0_13 (64-bit) Ukrainian 5.3% 20.6%

        It's not unequivocal, but there is a definite overall improvement in the patched version; I'd say these results justify applying the patch.

        I won't post them here, (mostly because I didn't save them ) but I've run the same benchmark (with some variation in the number of iterations) and noticed that while there are always a couple of places where the unpatched version appears to do slightly better, the place at which this occurs is not consistent, and the cases where the patched version improves throughput always dominate.

        Show
        Steve Rowe added a comment - Here are the unpatched results I got - these look quite similar to the results I posted from a custom (non-contrib-benchmark) benchmark in the description of LUCENE-1719 : Sun JVM Language java.text ICU4J KeywordAnalyzer ICU4J Improvement 1.5.0_15 (32-bit) English 9.00s 4.89s 2.90s 207% 1.5.0_15 (32-bit) French 10.64s 5.12s 2.95s 254% 1.5.0_15 (32-bit) German 10.19s 5.19s 2.97s 225% 1.5.0_15 (32-bit) Ukrainian 13.66s 7.20s 2.96s 152% Sun JVM Language java.text ICU4J KeywordAnalyzer ICU4J Improvement 1.5.0_15 (64-bit) English 5.97s 2.55s 1.50s 326% 1.5.0_15 (64-bit) French 6.86s 2.74s 1.56s 349% 1.5.0_15 (64-bit) German 6.85s 2.76s 1.59s 350% 1.5.0_15 (64-bit) Ukrainian 9.56s 4.01s 1.56s 227% Sun JVM Language java.text ICU4J KeywordAnalyzer ICU4J Improvement 1.6.0_13 (64-bit) English 3.04s 2.06s 1.07s 99% 1.6.0_13 (64-bit) French 3.58s 2.04s 1.14s 171% 1.6.0_13 (64-bit) German 3.35s 2.22s 1.14s 105% 1.6.0_13 (64-bit) Ukrainian 4.48s 2.94s 1.21s 89% Here are the results after applying the synced-to-trunk patch: Sun JVM Language java.text ICU4J KeywordAnalyzer ICU4J Improvement 1.5.0_15 (32-bit) English 8.73s 4.61s 2.90s 241% 1.5.0_15 (32-bit) French 10.38s 4.87s 2.94s 285% 1.5.0_15 (32-bit) German 9.95s 4.94s 2.97s 254% 1.5.0_15 (32-bit) Ukrainian 13.37s 6.91s 2.90s 161% Sun JVM Language java.text ICU4J KeywordAnalyzer ICU4J Improvement 1.5.0_15 (64-bit) English 5.78s 2.65s 1.57s 290% 1.5.0_15 (64-bit) French 6.74s 2.74s 1.64s 364% 1.5.0_15 (64-bit) German 6.69s 2.86s 1.66s 319% 1.5.0_15 (64-bit) Ukrainian 9.40s 4.18s 1.62s 204% Sun JVM Language java.text ICU4J KeywordAnalyzer ICU4J Improvement 1.6.0_13 (64-bit) English 3.06s 1.82s 1.09s 170% 1.6.0_13 (64-bit) French 3.36s 1.88s 1.16s 206% 1.6.0_13 (64-bit) German 3.40s 1.95s 1.14s 179% 1.6.0_13 (64-bit) Ukrainian 4.33s 2.65s 1.21s 117% And here is a comparison of the two: Sun JVM Language java.text improvement ICU4J improvement 1.5.0_15 (32-bit) English 5.1% 16.8% 1.5.0_15 (32-bit) French 3.8% 12.9% 1.5.0_15 (32-bit) German 3.9% 13.1% 1.5.0_15 (32-bit) Ukrainian 2.6% 6.2% Sun JVM Language java.text improvement ICU4J improvement 1.5.0_15 (64-bit) English 6.6% -2.2% 1.5.0_15 (64-bit) French 4.4% 7.7% 1.5.0_15 (64-bit) German 5.0% -2.0% 1.5.0_15 (64-bit) Ukrainian 3.3% -3.7% Sun JVM Language java.text improvement ICU4J improvement 1.6.0_13 (64-bit) English 0.5% 36.1% 1.6.0_13 (64-bit) French 11.4% 25.5% 1.6.0_13 (64-bit) German -1.7% 33.8% 1.6.0_13 (64-bit) Ukrainian 5.3% 20.6% It's not unequivocal, but there is a definite overall improvement in the patched version; I'd say these results justify applying the patch. I won't post them here, (mostly because I didn't save them ) but I've run the same benchmark (with some variation in the number of iterations) and noticed that while there are always a couple of places where the unpatched version appears to do slightly better, the place at which this occurs is not consistent, and the cases where the patched version improves throughput always dominate.
        Hide
        Steve Rowe added a comment -

        TopTFWikipediaWords.tar.bz2 contains a Maven2 project to parse unpacked Wikipedia dump files, create a Lucene index from the tokens produced by the contrib WikipediaTokenizer, iterate over the indexed tokens' term docs, accumulating term frequencies, store the results in a bounded priority queue, then output contrib benchmark LineDoc format, with the title field containing the collection term frequency, the date containing the date the file was generated, and the body containing the term text.

        This code knows how to handle English, German, French, and Ukrainian, but could be extended for other languages.

        I used this project to generate the line-docs for the 4 languages' 100k most frequent terms, in the collation benchmark archive attachment on this issue.

        Show
        Steve Rowe added a comment - TopTFWikipediaWords.tar.bz2 contains a Maven2 project to parse unpacked Wikipedia dump files, create a Lucene index from the tokens produced by the contrib WikipediaTokenizer, iterate over the indexed tokens' term docs, accumulating term frequencies, store the results in a bounded priority queue, then output contrib benchmark LineDoc format, with the title field containing the collection term frequency, the date containing the date the file was generated, and the body containing the term text. This code knows how to handle English, German, French, and Ukrainian, but could be extended for other languages. I used this project to generate the line-docs for the 4 languages' 100k most frequent terms, in the collation benchmark archive attachment on this issue.
        Hide
        Steve Rowe added a comment -

        Fixed up version of collation.benchmark.tar.bz2 that removes printing of progress from the collation/run-benchmark.sh script - otherwise the same as the previous version.

        Show
        Steve Rowe added a comment - Fixed up version of collation.benchmark.tar.bz2 that removes printing of progress from the collation/run-benchmark.sh script - otherwise the same as the previous version.
        Hide
        Steve Rowe added a comment -

        Do you think after this issue is resolved (whether it helps or doesn't help/won't fix either way) that we should open a separate issue to work on committing the benchmark so we have collation benchmarks for the future? Seems like it would be good to have in the future.

        Yeah, I don't know quite how to do that - the custom code wrapping ICU/CollationKeyAnalyzer is necessary because the contrib benchmark alg format can only handle zero-argument analyzer constructors (or those that take Version arguments). I think it would be useful to have a general-purpose alg syntax to handle this case without requiring custom code, and also, more generally, to allow for the construction of aribitrary analysis pipelines without requiring custom code (a la Solr schema). The alg parsing code is a bit dense though - I think it could be converted to a JFlex-generated grammar to simplify this kind of syntax extension.

        Can you think of an alternate way to package this benchmark that fits with current practice?

        Show
        Steve Rowe added a comment - Do you think after this issue is resolved (whether it helps or doesn't help/won't fix either way) that we should open a separate issue to work on committing the benchmark so we have collation benchmarks for the future? Seems like it would be good to have in the future. Yeah, I don't know quite how to do that - the custom code wrapping ICU/CollationKeyAnalyzer is necessary because the contrib benchmark alg format can only handle zero-argument analyzer constructors (or those that take Version arguments). I think it would be useful to have a general-purpose alg syntax to handle this case without requiring custom code, and also, more generally, to allow for the construction of aribitrary analysis pipelines without requiring custom code (a la Solr schema). The alg parsing code is a bit dense though - I think it could be converted to a JFlex-generated grammar to simplify this kind of syntax extension. Can you think of an alternate way to package this benchmark that fits with current practice?
        Hide
        Robert Muir added a comment -

        I think it would be useful to have a general-purpose alg syntax to handle this case without requiring custom code, and also, more generally, to allow for the construction of aribitrary analysis pipelines without requiring custom code (a la Solr schema). The alg parsing code is a bit dense though - I think it could be converted to a JFlex-generated grammar to simplify this kind of syntax extension.

        that would be a great improvement to the benchmark package!

        alternatively in the short term, i wonder if we can we somehow abuse a system property for the locale, setting it with the alg file to change the locale?

        Show
        Robert Muir added a comment - I think it would be useful to have a general-purpose alg syntax to handle this case without requiring custom code, and also, more generally, to allow for the construction of aribitrary analysis pipelines without requiring custom code (a la Solr schema). The alg parsing code is a bit dense though - I think it could be converted to a JFlex-generated grammar to simplify this kind of syntax extension. that would be a great improvement to the benchmark package! alternatively in the short term, i wonder if we can we somehow abuse a system property for the locale, setting it with the alg file to change the locale?
        Hide
        Uwe Schindler added a comment -

        In my opinion in the committed patch and also before there is an error in handling the ByteBuffer/CharBuffer arrayOffset() value. It calculated the length using limit()-arrayOffset() which is wrong. The length available in the buffer is simple remaining() - this is how all other encoders/decoders in the JDK work (see src). array() and arrayOffset() simply give the offset of the wrapped array, but do not say anything about the length. The starting position to decode must be also customized and is arrayOffset()+position() - The code fails, if you wrap an array with offset!=0 and some length or if you pass an not 0-positioned buffer.

        For correct usage of NIO Buffers see my latest improvements in the payload.IdentityEncoder (issue LUCENE-2157). Any code getting a buffer for encoding/decoding should assume, that it should read from the buffer starting at postion(), reading remaining() items.

        Show
        Uwe Schindler added a comment - In my opinion in the committed patch and also before there is an error in handling the ByteBuffer/CharBuffer arrayOffset() value. It calculated the length using limit()-arrayOffset() which is wrong. The length available in the buffer is simple remaining() - this is how all other encoders/decoders in the JDK work (see src). array() and arrayOffset() simply give the offset of the wrapped array, but do not say anything about the length. The starting position to decode must be also customized and is arrayOffset()+position() - The code fails, if you wrap an array with offset!=0 and some length or if you pass an not 0-positioned buffer. For correct usage of NIO Buffers see my latest improvements in the payload.IdentityEncoder (issue LUCENE-2157 ). Any code getting a buffer for encoding/decoding should assume, that it should read from the buffer starting at postion(), reading remaining() items.
        Hide
        Robert Muir added a comment -

        Uwe, another thing is that currently the indexablebinarystringcode requires that the buffer have a backing array, which is optional.

        one idea is that if we implement the code with byte[]/char[] as this patch does, perhaps we can make the buffer code (which is just helper methods around this) to remove this restriction

        after all, now we have a benchmark so we can measure this and make sure it does not hurt performance for the charbuffer/bytebuffer case, although with this patch these buffer methods are not used by any collation filters.

        Show
        Robert Muir added a comment - Uwe, another thing is that currently the indexablebinarystringcode requires that the buffer have a backing array, which is optional. one idea is that if we implement the code with byte[]/char[] as this patch does, perhaps we can make the buffer code (which is just helper methods around this) to remove this restriction after all, now we have a benchmark so we can measure this and make sure it does not hurt performance for the charbuffer/bytebuffer case, although with this patch these buffer methods are not used by any collation filters.
        Hide
        Robert Muir added a comment -

        Steve, I ran the benchmark, worked like a charm. here are my numbers.
        some differences are: very slow computer, newer jvm versions, etc.

        Unpatched:

        Sun JVM Language java.text ICU4J KeywordAnalyzer ICU4J Improvement
        1.5.0_22 (32-bit) English 31.61s 21.34s 10.66s 96%
        1.5.0_22 (32-bit) French 34.11s 22.59s 11.12s 100%
        1.5.0_22 (32-bit) German 34.59s 22.69s 10.84s 100%
        1.5.0_22 (32-bit) Ukrainian 50.72s 30.19s 11.09s 107%
        Sun JVM Language java.text ICU4J KeywordAnalyzer ICU4J Improvement
        1.6.0_16 (32-bit) English 26.42s 19.98s 10.11s 65%
        1.6.0_16 (32-bit) French 28.48s 20.14s 10.30s 85%
        1.6.0_16 (32-bit) German 28.41s 21.38s 9.78s 61%
        1.6.0_16 (32-bit) Ukrainian 40.55s 28.56s 9.70s 64%

        Patched:

        Sun JVM Language java.text ICU4J KeywordAnalyzer ICU4J Improvement
        1.5.0_22 (32-bit) English 29.59s 19.33s 10.56s 117%
        1.5.0_22 (32-bit) French 32.38s 21.02s 10.88s 112%
        1.5.0_22 (32-bit) German 32.92s 20.58s 10.66s 124%
        1.5.0_22 (32-bit) Ukrainian 49.59s 27.05s 10.77s 138%
        Sun JVM Language java.text ICU4J KeywordAnalyzer ICU4J Improvement
        1.6.0_16 (32-bit) English 24.91s 18.28s 9.72s 77%
        1.6.0_16 (32-bit) French 26.30s 19.12s 10.25s 81%
        1.6.0_16 (32-bit) German 26.70s 19.23s 10.48s 85%
        1.6.0_16 (32-bit) Ukrainian 38.48s 25.66s 9.80s 81%

        Comparison:

        Sun JVM Language java.text improvement ICU4J improvement
        1.5.0_22 (32-bit) English 10.5% 22.2%
        1.5.0_22 (32-bit) French 7.4% 13.6%
        1.5.0_22 (32-bit) German 7.1% 19.9%
        1.5.0_22 (32-bit) Ukrainian 2.5% 17.8%
        Sun JVM Language java.text improvement ICU4J improvement
        1.6.0_16 (32-bit) English 7.8% 15.8%
        1.6.0_16 (32-bit) French 13.7% 11.4%
        1.6.0_16 (32-bit) German 15.3% 33.0%
        1.6.0_16 (32-bit) Ukrainian 8.0% 19.4%
        Show
        Robert Muir added a comment - Steve, I ran the benchmark, worked like a charm. here are my numbers. some differences are: very slow computer, newer jvm versions, etc. Unpatched: Sun JVM Language java.text ICU4J KeywordAnalyzer ICU4J Improvement 1.5.0_22 (32-bit) English 31.61s 21.34s 10.66s 96% 1.5.0_22 (32-bit) French 34.11s 22.59s 11.12s 100% 1.5.0_22 (32-bit) German 34.59s 22.69s 10.84s 100% 1.5.0_22 (32-bit) Ukrainian 50.72s 30.19s 11.09s 107% Sun JVM Language java.text ICU4J KeywordAnalyzer ICU4J Improvement 1.6.0_16 (32-bit) English 26.42s 19.98s 10.11s 65% 1.6.0_16 (32-bit) French 28.48s 20.14s 10.30s 85% 1.6.0_16 (32-bit) German 28.41s 21.38s 9.78s 61% 1.6.0_16 (32-bit) Ukrainian 40.55s 28.56s 9.70s 64% Patched: Sun JVM Language java.text ICU4J KeywordAnalyzer ICU4J Improvement 1.5.0_22 (32-bit) English 29.59s 19.33s 10.56s 117% 1.5.0_22 (32-bit) French 32.38s 21.02s 10.88s 112% 1.5.0_22 (32-bit) German 32.92s 20.58s 10.66s 124% 1.5.0_22 (32-bit) Ukrainian 49.59s 27.05s 10.77s 138% Sun JVM Language java.text ICU4J KeywordAnalyzer ICU4J Improvement 1.6.0_16 (32-bit) English 24.91s 18.28s 9.72s 77% 1.6.0_16 (32-bit) French 26.30s 19.12s 10.25s 81% 1.6.0_16 (32-bit) German 26.70s 19.23s 10.48s 85% 1.6.0_16 (32-bit) Ukrainian 38.48s 25.66s 9.80s 81% Comparison: Sun JVM Language java.text improvement ICU4J improvement 1.5.0_22 (32-bit) English 10.5% 22.2% 1.5.0_22 (32-bit) French 7.4% 13.6% 1.5.0_22 (32-bit) German 7.1% 19.9% 1.5.0_22 (32-bit) Ukrainian 2.5% 17.8% Sun JVM Language java.text improvement ICU4J improvement 1.6.0_16 (32-bit) English 7.8% 15.8% 1.6.0_16 (32-bit) French 13.7% 11.4% 1.6.0_16 (32-bit) German 15.3% 33.0% 1.6.0_16 (32-bit) Ukrainian 8.0% 19.4%
        Hide
        Robert Muir added a comment -

        In my opinion in the committed patch and also before there is an error in handling the ByteBuffer/CharBuffer arrayOffset() value. It calculated the length using limit()-arrayOffset() which is wrong

        Uwe, I think you might be wrong here.
        In this class, the javadocs state the buffer must have a backing array.
        Therefore, limit - offset is a correct way to calculate length of the entire buffer, which is what this class has always used.

        The length available in the buffer is simple remaining() - this is how all other encoders/decoders in the JDK work (see src).

        This is not correct, this is the remaining relative length, as remaining() is based upon position: Returns the number of elements between the current position and the limit.
        Unlike typical jdk code, this code always ignored relative position and instead encodes/decodes entire buffers.

        We should not change this under this issue... it is unrelated. I do not think we should change the semantics of this encode/decode to take any relative position into account.

        Show
        Robert Muir added a comment - In my opinion in the committed patch and also before there is an error in handling the ByteBuffer/CharBuffer arrayOffset() value. It calculated the length using limit()-arrayOffset() which is wrong Uwe, I think you might be wrong here. In this class, the javadocs state the buffer must have a backing array. Therefore, limit - offset is a correct way to calculate length of the entire buffer, which is what this class has always used. The length available in the buffer is simple remaining() - this is how all other encoders/decoders in the JDK work (see src). This is not correct, this is the remaining relative length, as remaining() is based upon position: Returns the number of elements between the current position and the limit. Unlike typical jdk code, this code always ignored relative position and instead encodes/decodes entire buffers. We should not change this under this issue... it is unrelated. I do not think we should change the semantics of this encode/decode to take any relative position into account.
        Hide
        Robert Muir added a comment -

        updated patch:

        • deprecate the now unused CharBuffer/ByteBuffer methods.
        • add tests to test the char[]/byte[] based methods (keep the old NIO ones, but deprecate them)
        • change getEncodedLength to take byte[],int,int consistent with getDecodedLength. even though it only needs to know the number of bytes, in the future if we want to modify this encoding, we will not need to break the api.
        • javadocs fixes
        Show
        Robert Muir added a comment - updated patch: deprecate the now unused CharBuffer/ByteBuffer methods. add tests to test the char[]/byte[] based methods (keep the old NIO ones, but deprecate them) change getEncodedLength to take byte[],int,int consistent with getDecodedLength. even though it only needs to know the number of bytes, in the future if we want to modify this encoding, we will not need to break the api. javadocs fixes
        Hide
        Uwe Schindler added a comment - - edited

        Therefore, limit - offset is a correct way to calculate length of the entire buffer, which is what this class has always used.

        This is exactly wrong: The lengthz of the "valid" area inside the buffer is always limit()-position() [see javadocs of wrap()]. If arrayOffset()!=0, position() may still be 0 (because the arrayoffset is only used when wrapping byte arrays. The arrayOffset() defines the postion in the underlying array that defines the buffer's position()==0 (see javadocs).

        This error is mostly no problem for dynamical allocated array buffers (as most users use), because arrayOffset() is always 0 (I have never seen a buffer with offset != 0)

        A problem also occurs if you wrap an array with wrap([], offset, length). the offset there is not the arrayOffset(), it is the position()! And length is after wrapping remaining(). The capacity() is the array length. Therefore it is really a bug. You deprecated it, that's good, but it should be fixed, it does not work correct.

        For verifying, read the source code of ByteBuffer and HeapByteBuffer from src.zip in your jdk distrib and of course the javadocs.

        So just to repeat, the correct code would be:

        • start postion in array=arrayOffset()+position()
        • length=remaining()

        This is how the IdentityEncode for Payloads works. So deprecate and fix it.
        If we really want to work with buffers, the correct way to implement this would be that the whole class implements CharsetEncoder.

        Show
        Uwe Schindler added a comment - - edited Therefore, limit - offset is a correct way to calculate length of the entire buffer, which is what this class has always used. This is exactly wrong: The lengthz of the "valid" area inside the buffer is always limit()-position() [see javadocs of wrap()] . If arrayOffset()!=0, position() may still be 0 (because the arrayoffset is only used when wrapping byte arrays. The arrayOffset() defines the postion in the underlying array that defines the buffer's position()==0 (see javadocs). This error is mostly no problem for dynamical allocated array buffers (as most users use), because arrayOffset() is always 0 (I have never seen a buffer with offset != 0) A problem also occurs if you wrap an array with wrap([], offset, length). the offset there is not the arrayOffset(), it is the position()! And length is after wrapping remaining(). The capacity() is the array length. Therefore it is really a bug. You deprecated it, that's good, but it should be fixed, it does not work correct. For verifying, read the source code of ByteBuffer and HeapByteBuffer from src.zip in your jdk distrib and of course the javadocs. So just to repeat, the correct code would be: start postion in array=arrayOffset()+position() length=remaining() This is how the IdentityEncode for Payloads works. So deprecate and fix it. If we really want to work with buffers, the correct way to implement this would be that the whole class implements CharsetEncoder.
        Hide
        Uwe Schindler added a comment - - edited

        I found one place where the arrayOffset() bug occurs:

        • Create an ByteBuffer by wrapping another byte[] with offset and length.
        • use ByteBuffer.slice() to create an independednt buffer which position()==0 the current position of the prev ByteBuffer (which is the wrapped byte[]). This buffer will have an arrayOffset() of the original offset (which is position=0 in this buffer). The limit/cpacity of this buffer is the above length, which is remaining() on the original buffer.

        The length formula is the definitely wrong, if would sometimes undeflow and give negative numbers, e.g. if the position() of the sliced buffer if e.g. 5 (so arrayoffset==5 in the slice) and the remaining bytes are e.g. 3 => length in the code before/after tha patch is -2.

        I will provide a testcase later that shows both bugs (position() and arrayOffset() misuse).

        Show
        Uwe Schindler added a comment - - edited I found one place where the arrayOffset() bug occurs: Create an ByteBuffer by wrapping another byte[] with offset and length. use ByteBuffer.slice() to create an independednt buffer which position()==0 the current position of the prev ByteBuffer (which is the wrapped byte[]). This buffer will have an arrayOffset() of the original offset (which is position=0 in this buffer). The limit/cpacity of this buffer is the above length, which is remaining() on the original buffer. The length formula is the definitely wrong, if would sometimes undeflow and give negative numbers, e.g. if the position() of the sliced buffer if e.g. 5 (so arrayoffset==5 in the slice) and the remaining bytes are e.g. 3 => length in the code before/after tha patch is -2. I will provide a testcase later that shows both bugs (position() and arrayOffset() misuse).
        Hide
        Robert Muir added a comment -

        Uwe, can we please do any nio bugfixes under another issue?

        Its not related to this one. This issue is about not using NIO since it only makes collation slower.

        Show
        Robert Muir added a comment - Uwe, can we please do any nio bugfixes under another issue? Its not related to this one. This issue is about not using NIO since it only makes collation slower.
        Hide
        Robert Muir added a comment -

        in this patch, preserve the javadoc at the beginning of this class which describes in detail the previous behavior of this class: exactly how the length is calculated from arrayOffset() and limit()

        the only difference is i change the text from
        This class's operations are defined over CharBuffers and ByteBuffers ...
        to
        Some methods in this class are defined over CharBuffers and ByteBuffers ..

        imho, this existing nio behavior is documented and now deprecated and should not be changed, and definitely not changed in this issue.

        Show
        Robert Muir added a comment - in this patch, preserve the javadoc at the beginning of this class which describes in detail the previous behavior of this class: exactly how the length is calculated from arrayOffset() and limit() the only difference is i change the text from This class's operations are defined over CharBuffers and ByteBuffers ... to Some methods in this class are defined over CharBuffers and ByteBuffers .. imho, this existing nio behavior is documented and now deprecated and should not be changed, and definitely not changed in this issue.
        Hide
        Robert Muir added a comment -

        Hi, this is a simple, straightforward performance improvement between 10% and 20% on average it looks, with no backwards problems.

        What Uwe describes as a problem is not introduced here, but exists already and is documented as such in the class javadocs, and preserved for backwards compatibility.

        So, I think such backwards-breakage should be under a separate issue.

        I will commit this one in a few days if no one objects.

        Show
        Robert Muir added a comment - Hi, this is a simple, straightforward performance improvement between 10% and 20% on average it looks, with no backwards problems. What Uwe describes as a problem is not introduced here, but exists already and is documented as such in the class javadocs, and preserved for backwards compatibility. So, I think such backwards-breakage should be under a separate issue. I will commit this one in a few days if no one objects.
        Hide
        Uwe Schindler added a comment - - edited

        +1

        And please add a note that the Char/ByteBuffer methods are completely broken. I have two test here that broke with negativeArraySize and so on. The person who wrote the orginbal CharBuffer code misunderstood the indeed complicated logic behind ByteBuffers.

        So the only requirement from my side is to write in the deprecated message or the docs on top:

        instead:

         * Note that this class calls array() and arrayOffset()
         * on the CharBuffers and ByteBuffers it uses, so only wrapped arrays may be
         * used.  This class interprets the arrayOffset() and limit() values returned by
         * its input buffers as beginning and end+1 positions on the wrapped array,
         * respectively; similarly, on the output buffer, arrayOffset() is the first
         * position written to, and limit() is set to one past the final output array
         * position.
        

        please:

         * The methods operating on CharBuffers and ByteBuffers are
         * deprecated in favour of methods that directly operate on byte[] and char[]
         * arrays. Note that this class calls array() and arrayOffset()
         * on the CharBuffers and ByteBuffers it uses, so only wrapped arrays may be
         * used. Furthermore, the deprecated methods are only working
         * correctly, if you simply use Buffer.wrap on an array without any offsets != 0
         * and not work on buffers that are returned by the slice() method.
         * Buffers wrapped with an offset completely ignore the offset and transform
         * limit() bytes. Buffer slices are completely broken and calculate
         * the length of the buffer sometimes as negative.
        

        Maybe in better english.

        Show
        Uwe Schindler added a comment - - edited +1 And please add a note that the Char/ByteBuffer methods are completely broken. I have two test here that broke with negativeArraySize and so on. The person who wrote the orginbal CharBuffer code misunderstood the indeed complicated logic behind ByteBuffers. So the only requirement from my side is to write in the deprecated message or the docs on top: instead: * Note that this class calls array() and arrayOffset() * on the CharBuffers and ByteBuffers it uses, so only wrapped arrays may be * used. This class interprets the arrayOffset() and limit() values returned by * its input buffers as beginning and end+1 positions on the wrapped array, * respectively; similarly, on the output buffer, arrayOffset() is the first * position written to, and limit() is set to one past the final output array * position. please: * The methods operating on CharBuffers and ByteBuffers are * deprecated in favour of methods that directly operate on byte[] and char[] * arrays. Note that this class calls array() and arrayOffset() * on the CharBuffers and ByteBuffers it uses, so only wrapped arrays may be * used. Furthermore, the deprecated methods are only working * correctly, if you simply use Buffer.wrap on an array without any offsets != 0 * and not work on buffers that are returned by the slice() method. * Buffers wrapped with an offset completely ignore the offset and transform * limit() bytes. Buffer slices are completely broken and calculate * the length of the buffer sometimes as negative. Maybe in better english.
        Hide
        Robert Muir added a comment -

        Thanks Uwe, I will incorporate some of this wording into the javadocs and upload a new patch.

        later, if we want we can discuss changing how the nio works, perhaps it is best to break the back compat, but i am just trying to avoid doing this within the same issue as a performance improvement that is trying to preserve all back compat

        Show
        Robert Muir added a comment - Thanks Uwe, I will incorporate some of this wording into the javadocs and upload a new patch. later, if we want we can discuss changing how the nio works, perhaps it is best to break the back compat, but i am just trying to avoid doing this within the same issue as a performance improvement that is trying to preserve all back compat
        Hide
        Uwe Schindler added a comment -

        OK.

        I do not want to discuss further, the correct way for both directions according to NIO would be:

        final int inputOffset = input.arrayOffset() + input.position();
        final int inputLength = input.remaining();
        final int outputOffset = output.arrayOffset() + output.position();
        final int outputLength = getEncodedLength(input.array(), inputOffset, inputLength);
        output.limit(outputLength + output.position());
        //bogus: output.position(0);
        

        The same the other way round (only with getEncodedLength()).

        If we want to fix this, put this into the new issue. But as the methods are deprecated, with the big warning we do not need to fix it.

        Show
        Uwe Schindler added a comment - OK. I do not want to discuss further, the correct way for both directions according to NIO would be: final int inputOffset = input.arrayOffset() + input.position(); final int inputLength = input.remaining(); final int outputOffset = output.arrayOffset() + output.position(); final int outputLength = getEncodedLength(input.array(), inputOffset, inputLength); output.limit(outputLength + output.position()); //bogus: output.position(0); The same the other way round (only with getEncodedLength()). If we want to fix this, put this into the new issue. But as the methods are deprecated, with the big warning we do not need to fix it.
        Hide
        Robert Muir added a comment -

        updated patch with uwe's javadoc suggestions

        Show
        Robert Muir added a comment - updated patch with uwe's javadoc suggestions
        Hide
        Steve Rowe added a comment -

        Hi Robert, I took a look at the patch and found a couple of minor issues:

        1. The newly deprecated methods should get @Deprecated annotations (in addition to the @deprecated javadoc tags)
        2. IntelliJ tells me that the "final" modifier on some of the public static methods is not cool - AFAICT, although static implies final, it may be useful to leave this anyway, since unlike the static modifier, the final modifier disallows hiding of the method by sublasses? I dunno. (Checking Lucene source, there are many "static final" methods, so maybe I should tell IntelliJ it's not a problem.)
        3. Unlike getEncodedLength(byte[],int,int), getDecodedLength(char[],int,int) doesn't protect against overflow in the int multiplication by casting to long.
        Show
        Steve Rowe added a comment - Hi Robert, I took a look at the patch and found a couple of minor issues: The newly deprecated methods should get @Deprecated annotations (in addition to the @deprecated javadoc tags) IntelliJ tells me that the "final" modifier on some of the public static methods is not cool - AFAICT, although static implies final, it may be useful to leave this anyway, since unlike the static modifier, the final modifier disallows hiding of the method by sublasses? I dunno. (Checking Lucene source, there are many "static final" methods, so maybe I should tell IntelliJ it's not a problem.) Unlike getEncodedLength(byte[],int,int), getDecodedLength(char[],int,int) doesn't protect against overflow in the int multiplication by casting to long.
        Hide
        Robert Muir added a comment -

        Steve, thanks for pointing these out.

        #3 concerns me somewhat, this is an existing problem in trunk (i guess only for enormous terms, though). Should we consider backporting a fix?

        I will upload a new patch, takes a couple tries for one this large on my internet connection...

        Show
        Robert Muir added a comment - Steve, thanks for pointing these out. #3 concerns me somewhat, this is an existing problem in trunk (i guess only for enormous terms, though). Should we consider backporting a fix? I will upload a new patch, takes a couple tries for one this large on my internet connection...
        Hide
        Robert Muir added a comment -

        updated patch, with Steven's additions.

        Show
        Robert Muir added a comment - updated patch, with Steven's additions.
        Hide
        Robert Muir added a comment -

        for the record, i think in the future we should decide on some consistency with @Deprecated annotation in lucene.

        Personally, I think it is useless

        Show
        Robert Muir added a comment - for the record, i think in the future we should decide on some consistency with @Deprecated annotation in lucene. Personally, I think it is useless
        Hide
        Uwe Schindler added a comment -

        It is somehow useless, because you always have to add a @deprecated also to javadocs with an explanation. If we want to add it, wen can use eclipse to do it automatically with the cleanup function (like I added @Override).

        Show
        Uwe Schindler added a comment - It is somehow useless, because you always have to add a @deprecated also to javadocs with an explanation. If we want to add it, wen can use eclipse to do it automatically with the cleanup function (like I added @Override).
        Hide
        Robert Muir added a comment -

        Uwe, I think if we are going to add it here, then we should add it everywhere separately in a later issue.

        otherwise, we should not add it here, and remove it anywhere it is found.

        its 100% definitely useless if we do not consistently use it for all deprecations.

        Show
        Robert Muir added a comment - Uwe, I think if we are going to add it here, then we should add it everywhere separately in a later issue. otherwise, we should not add it here, and remove it anywhere it is found. its 100% definitely useless if we do not consistently use it for all deprecations.
        Hide
        Uwe Schindler added a comment -

        It is now used also in new commits (not only this issue). And flex uses it very often.

        Show
        Uwe Schindler added a comment - It is now used also in new commits (not only this issue). And flex uses it very often.
        Hide
        Steve Rowe added a comment -

        3. Unlike getEncodedLength(byte[],int,int), getDecodedLength(char[],int,int) doesn't protect against overflow in the int multiplication by casting to long.

        #3 concerns me somewhat, this is an existing problem in trunk (i guess only for enormous terms, though). Should we consider backporting a fix?

        The current form of this calculation will correctly handle original binary content of lengths up to 136MB. IMHO the likelihood of encoding terms this enormous with IndexableBinaryStringTools is so miniscule that it's not worth the effort to backport.

        Show
        Steve Rowe added a comment - 3. Unlike getEncodedLength(byte[],int,int), getDecodedLength(char[],int,int) doesn't protect against overflow in the int multiplication by casting to long. #3 concerns me somewhat, this is an existing problem in trunk (i guess only for enormous terms, though). Should we consider backporting a fix? The current form of this calculation will correctly handle original binary content of lengths up to 136MB. IMHO the likelihood of encoding terms this enormous with IndexableBinaryStringTools is so miniscule that it's not worth the effort to backport.
        Hide
        Robert Muir added a comment -

        Committed revision 895341.

        Thanks Steven and Uwe

        Show
        Robert Muir added a comment - Committed revision 895341. Thanks Steven and Uwe

          People

          • Assignee:
            Robert Muir
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development