Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.21.0
    • Component/s: performance, util
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      We've seen a reducer writing 200MB to HDFS with replication = 1 spending a long time in crc calculation. In particular, it was spending 5 seconds in crc calculation out of a total of 6 for the write. I suspect that it is the java-jni border that is causing us grief.

      1. benchmarks20090714.txt
        9 kB
        Tsz Wo Nicholas Sze
      2. benchmarks20090715.txt
        12 kB
        Tsz Wo Nicholas Sze
      3. crc32-results.txt
        3 kB
        Todd Lipcon
      4. hadoop-5598.txt
        7 kB
        Todd Lipcon
      5. hadoop-5598.txt
        7 kB
        Todd Lipcon
      6. hadoop-5598-evil.txt
        8 kB
        Todd Lipcon
      7. hadoop-5598-hybrid.txt
        11 kB
        Todd Lipcon
      8. hadoop-6148.txt
        25 kB
        Todd Lipcon
      9. hadoop-6148.txt
        25 kB
        Todd Lipcon
      10. hadoop-6148.txt
        25 kB
        Todd Lipcon
      11. hdfs-297.txt
        19 kB
        Todd Lipcon
      12. PureJavaCrc32.java
        18 kB
        Tsz Wo Nicholas Sze
      13. PureJavaCrc32.java
        17 kB
        Scott Carey
      14. PureJavaCrc32.java
        2 kB
        Scott Carey
      15. PureJavaCrc32.java
        2 kB
        Scott Carey
      16. PureJavaCrc32New.java
        16 kB
        Scott Carey
      17. PureJavaCrc32NewInner.java
        16 kB
        Scott Carey
      18. PureJavaCrc32NewLoop.java
        16 kB
        Scott Carey
      19. TestCrc32Performance.java
        3 kB
        Scott Carey
      20. TestCrc32Performance.java
        2 kB
        Scott Carey
      21. TestCrc32Performance.java
        2 kB
        Todd Lipcon
      22. TestCrc32Performance.java
        1 kB
        Todd Lipcon
      23. TestPureJavaCrc32.java
        3 kB
        Scott Carey

        Issue Links

          Activity

          Hide
          Tsz Wo Nicholas Sze added a comment -

          There is a CRC class in org.apache.hadoop.io.compress.bzip2. Looks like that it supports computing CRC32.

          Show
          Tsz Wo Nicholas Sze added a comment - There is a CRC class in org.apache.hadoop.io.compress.bzip2. Looks like that it supports computing CRC32.
          Hide
          Ben Maurer added a comment -

          Is it possible that a fix for HADOOP-5318 would help this issue – the java/jni border should only slow things down if very small amounts of data are being CRCd.

          Show
          Ben Maurer added a comment - Is it possible that a fix for HADOOP-5318 would help this issue – the java/jni border should only slow things down if very small amounts of data are being CRCd.
          Hide
          Todd Lipcon added a comment -

          This is a patch to implement CRC32 in Pure Java, along with a performance test that shows its improvement. Also attaching the benchmark output from both Sun 1.6.0_12 and OpenJDK 1.6.0_0-b12, which looks pretty different.

          The summary is that, on Sun's JDK (which most people use), the pure Java implementation is faster for all chunk sizes less than 32 bytes (by a high factor for the smaller end of the spectrum) and about 33% slower for chunk sizes larger than that. On OpenJDK, the CRC32 implementation is 3-4x faster than the Sun JDK.

          Running the concurrency benchmark from HADOOP-5318 also shows huge improvements (the same as was seen with Ben's buffering patch) by using the pure Java CRC32. This patch contains the change to FSDataOutputStream to make use of it.

          Review from someone who understands Java's bit extension semantics better than me would be appreciated - I bet more performance can be squeezed out of this by a Java bitwise op master.

          Show
          Todd Lipcon added a comment - This is a patch to implement CRC32 in Pure Java, along with a performance test that shows its improvement. Also attaching the benchmark output from both Sun 1.6.0_12 and OpenJDK 1.6.0_0-b12, which looks pretty different. The summary is that, on Sun's JDK (which most people use), the pure Java implementation is faster for all chunk sizes less than 32 bytes (by a high factor for the smaller end of the spectrum) and about 33% slower for chunk sizes larger than that. On OpenJDK, the CRC32 implementation is 3-4x faster than the Sun JDK. Running the concurrency benchmark from HADOOP-5318 also shows huge improvements (the same as was seen with Ben's buffering patch) by using the pure Java CRC32. This patch contains the change to FSDataOutputStream to make use of it. Review from someone who understands Java's bit extension semantics better than me would be appreciated - I bet more performance can be squeezed out of this by a Java bitwise op master.
          Hide
          Todd Lipcon added a comment -

          Attached is a new version that is a hybrid implementation. For writes smaller than a threshold it calculates CRC32 natively. Above the threshold, it uses the java.util.zip implementation, which it folds back in lazily using crc32_combine ported from zlib.

          On the old TestCrc32Performance benchmark, this version was always faster or as fast as "theirs". I added a new benchmark test which sizes the writes randomly, on which the hybrid version is awful in certain cases since it spends most of its time in crc32_combine. For this hybrid model to work, there will need to be some kind of hysteresis when switching between implementations, so as to avoid crc32_combine.

          If someone has Java 1.6 update 14 handy, I'd be interested to see if the new array bounds checking elimination optimization makes the pure Java fast enough to completely replace java.util.zip's.

          Show
          Todd Lipcon added a comment - Attached is a new version that is a hybrid implementation. For writes smaller than a threshold it calculates CRC32 natively. Above the threshold, it uses the java.util.zip implementation, which it folds back in lazily using crc32_combine ported from zlib. On the old TestCrc32Performance benchmark, this version was always faster or as fast as "theirs". I added a new benchmark test which sizes the writes randomly, on which the hybrid version is awful in certain cases since it spends most of its time in crc32_combine. For this hybrid model to work, there will need to be some kind of hysteresis when switching between implementations, so as to avoid crc32_combine. If someone has Java 1.6 update 14 handy, I'd be interested to see if the new array bounds checking elimination optimization makes the pure Java fast enough to completely replace java.util.zip's.
          Hide
          dhruba borthakur added a comment -

          Looking at the results, it appears that a choice of Crc algorithms, crc data sizes and JVMs can result in pretty varied performance numbers. Maybe it would be worthwhile to make the ChecksymFilesystem pick a Checksum object based on a config parameter.

          Also, the hybrid model of dynamically deciding which algo to use (based on the size of data to checksum) sounds a litle scary to me

          Show
          dhruba borthakur added a comment - Looking at the results, it appears that a choice of Crc algorithms, crc data sizes and JVMs can result in pretty varied performance numbers. Maybe it would be worthwhile to make the ChecksymFilesystem pick a Checksum object based on a config parameter. Also, the hybrid model of dynamically deciding which algo to use (based on the size of data to checksum) sounds a litle scary to me
          Hide
          Todd Lipcon added a comment -

          The other solution that would solve this whole issue is to simply not checksum the data in FSOutputSummer until an entire chunk has been buffered up. This would bring the "write size" up to io.bytes.per.checksum, 512 by default, which performs better in java.util.zip.CRC32 than in the Java implementation. Doug expressed some concern over that idea in HADOOP-5318, I guess since data can sit in memory for an arbitrarily long amount of time before getting checksummed, but I would imagine most people are using ECC RAM anyway

          Show
          Todd Lipcon added a comment - The other solution that would solve this whole issue is to simply not checksum the data in FSOutputSummer until an entire chunk has been buffered up. This would bring the "write size" up to io.bytes.per.checksum, 512 by default, which performs better in java.util.zip.CRC32 than in the Java implementation. Doug expressed some concern over that idea in HADOOP-5318 , I guess since data can sit in memory for an arbitrarily long amount of time before getting checksummed, but I would imagine most people are using ECC RAM anyway
          Hide
          Todd Lipcon added a comment -

          Here's another approach, which is exceedingly evil, but also "best of both worlds" in terms of speed. It switches between pure-Java CRC32 and the built-in CRC32, but avoids the slow crc_combine function. It does so by cheating using reflection to set the internal (private) crc member of the java.util.zip.CRC32.

          This is probably not a good idea to actually use, but it does show an "upper bound" in terms of performance. In benchmarks, this "fast" version is always as fast as the faster of the pure and built-in CRC32 routines. In case the reflection fails, it falls back to always pure.

          Owen and Arun: could you guys comment on the real-life workload where you saw this as an issue? I would imagine most MR workloads don't have this problem since they're spilling out of an in-memory buffer and therefore can write large chunks.

          Show
          Todd Lipcon added a comment - Here's another approach, which is exceedingly evil, but also "best of both worlds" in terms of speed. It switches between pure-Java CRC32 and the built-in CRC32, but avoids the slow crc_combine function. It does so by cheating using reflection to set the internal (private) crc member of the java.util.zip.CRC32. This is probably not a good idea to actually use, but it does show an "upper bound" in terms of performance. In benchmarks, this "fast" version is always as fast as the faster of the pure and built-in CRC32 routines. In case the reflection fails, it falls back to always pure. Owen and Arun: could you guys comment on the real-life workload where you saw this as an issue? I would imagine most MR workloads don't have this problem since they're spilling out of an in-memory buffer and therefore can write large chunks.
          Hide
          Todd Lipcon added a comment -

          I managed to coerce Java into cooperating with int sized variables, and it's significantly faster now. It's now about 5-10% slower than the built-in CRC for large sizes.

          Show
          Todd Lipcon added a comment - I managed to coerce Java into cooperating with int sized variables, and it's significantly faster now. It's now about 5-10% slower than the built-in CRC for large sizes.
          Hide
          Scott Carey added a comment -

          This version of PureJavaCrc32.java significantly beats, or equals, the native implementation. For small chunks, it is far faster. For large chunks it is equal.

          This version is simply a term expansion of the previous one, and it operates four bytes at a time.

          There is a way to nearly double this one more time for large chunks, but it is proving tricky to nail down correctly.

          Show
          Scott Carey added a comment - This version of PureJavaCrc32.java significantly beats, or equals, the native implementation. For small chunks, it is far faster. For large chunks it is equal. This version is simply a term expansion of the previous one, and it operates four bytes at a time. There is a way to nearly double this one more time for large chunks, but it is proving tricky to nail down correctly.
          Hide
          Todd Lipcon added a comment -

          Scott: I just tried your version and was unable to get the same performance improvements. I think we've established that the pure Java definitely wins on small blocks. For large blocks, I'm seeing the following on my laptop (64-bit, with 64-bit JRE):

          My most recent non-evil pure-Java: 250M/sec
          Scott's patch that unrolls the loop: 260-280M/sec
          Sun Java 1.6 update 14: 333M/sec
          OpenJDK 1.6: 795M/sec

          The OpenJDK implementation is simply wrapping zlib's crc32 routine, which must be highly optimized. Given that we already have a JNI library for native compression using zlib, I'd like to simply add a stub to libhadoop that wraps zlib's crc32. That should give us the same ~800M/sec throughput for large blocks. Since we can implement the stub ourself, we also have the ability to switch to pure Java for small sizes and get the 20x speedup with no adversarial workloads that cause bad performance. On systems where the native code isn't available, we can simply use the pure Java for all sizes, since at worst it's only slightly slower than java.util.Crc32 and at best it's 30x faster.

          I imagine that most production systems are using libhadoop, or at least could easily get this deployed if it was shown to have significant performance benefits.

          I'll upload a patch later this evening for this.

          Show
          Todd Lipcon added a comment - Scott: I just tried your version and was unable to get the same performance improvements. I think we've established that the pure Java definitely wins on small blocks. For large blocks, I'm seeing the following on my laptop (64-bit, with 64-bit JRE): My most recent non-evil pure-Java: 250M/sec Scott's patch that unrolls the loop: 260-280M/sec Sun Java 1.6 update 14: 333M/sec OpenJDK 1.6: 795M/sec The OpenJDK implementation is simply wrapping zlib's crc32 routine, which must be highly optimized. Given that we already have a JNI library for native compression using zlib, I'd like to simply add a stub to libhadoop that wraps zlib's crc32. That should give us the same ~800M/sec throughput for large blocks. Since we can implement the stub ourself, we also have the ability to switch to pure Java for small sizes and get the 20x speedup with no adversarial workloads that cause bad performance. On systems where the native code isn't available, we can simply use the pure Java for all sizes, since at worst it's only slightly slower than java.util.Crc32 and at best it's 30x faster. I imagine that most production systems are using libhadoop, or at least could easily get this deployed if it was shown to have significant performance benefits. I'll upload a patch later this evening for this.
          Hide
          Scott Carey added a comment -

          This version does some algebraic manipulation and gets to about 10% faster than the native implementation on large blocks on my machine (Java 1.6, mac osx, 64 bit, 2.5Ghz Core2 duo).

          pure java 16MB block: 397.516 MB/sec
          sun native 16MB block: 337.731 MB/sec

          This version uses the same lookup table as the previous, occupying 1KB.

          I have another pure java version that uses four lookup tables (4KB) that I will be posting shortly after I clean it up.

          Its results for large blocks are:

          pure java 16MB block: 624.390 MB/sec
          sun native 16MB block: 342.246 MB/sec

          it first breaks 600MB/sec at a block size of 128 bytes and is over 520MB/sec at a block size of 32 bytes.

          A big remaining question is performance under concurrency. The larger lookup table footprint may bring this version down a little.
          Any version calling out to native code may also slow under concurrency.

          Show
          Scott Carey added a comment - This version does some algebraic manipulation and gets to about 10% faster than the native implementation on large blocks on my machine (Java 1.6, mac osx, 64 bit, 2.5Ghz Core2 duo). pure java 16MB block: 397.516 MB/sec sun native 16MB block: 337.731 MB/sec This version uses the same lookup table as the previous, occupying 1KB. I have another pure java version that uses four lookup tables (4KB) that I will be posting shortly after I clean it up. Its results for large blocks are: pure java 16MB block: 624.390 MB/sec sun native 16MB block: 342.246 MB/sec it first breaks 600MB/sec at a block size of 128 bytes and is over 520MB/sec at a block size of 32 bytes. A big remaining question is performance under concurrency. The larger lookup table footprint may bring this version down a little. Any version calling out to native code may also slow under concurrency.
          Hide
          Scott Carey added a comment -

          This version of PureJavaCrc32 is between 10x and 1.8x faster than Sun's native implementation depending on the chunk size.

          Results on my latop (mac osx with Java 1.6 64 bit, -Xmx512m, 2.5Ghz processor) below. Run to run, results vary by about 5%.

          CRC32Class num bytes throughput
          PureJava 1 99.533 MB/sec
          SunNative 1 9.772 MB/sec
          PureJava 2 163.265 MB/sec
          SunNative 2 18.846 MB/sec
          PureJava 4 234.004 MB/sec
          SunNative 4 37.307 MB/sec
          PureJava 8 307.692 MB/sec
          SunNative 8 66.876 MB/sec
          PureJava 16 432.432 MB/sec
          SunNative 16 110.919 MB/sec
          PureJava 32 522.449 MB/sec
          SunNative 32 161.616 MB/sec
          PureJava 64 547.009 MB/sec
          SunNative 64 217.687 MB/sec
          PureJava 128 432.432 MB/sec
          SunNative 128 270.042 MB/sec
          PureJava 256 551.724 MB/sec
          SunNative 256 299.065 MB/sec
          PureJava 512 615.385 MB/sec
          SunNative 512 321.608 MB/sec
          PureJava 1024 551.724 MB/sec
          SunNative 1024 212.625 MB/sec
          PureJava 2048 561.404 MB/sec
          SunNative 2048 309.179 MB/sec
          PureJava 4096 551.724 MB/sec
          SunNative 4096 307.692 MB/sec
          PureJava 8192 589.862 MB/sec
          SunNative 8192 316.049 MB/sec
          PureJava 16384 640.000 MB/sec
          SunNative 16384 343.164 MB/sec
          PureJava 32768 643.216 MB/sec
          SunNative 32768 343.164 MB/sec
          PureJava 65536 621.359 MB/sec
          SunNative 65536 345.013 MB/sec
          PureJava 131072 636.816 MB/sec
          SunNative 131072 345.946 MB/sec
          PureJava 262144 636.816 MB/sec
          SunNative 262144 343.164 MB/sec
          PureJava 524288 646.465 MB/sec
          SunNative 524288 345.946 MB/sec
          PureJava 1048576 640.000 MB/sec
          SunNative 1048576 343.164 MB/sec
          PureJava 2097152 633.663 MB/sec
          SunNative 2097152 347.826 MB/sec
          PureJava 4194304 636.816 MB/sec
          SunNative 4194304 291.572 MB/sec
          PureJava 8388608 618.357 MB/sec
          SunNative 8388608 342.246 MB/sec
          PureJava 16777216 624.390 MB/sec
          SunNative 16777216 307.692 MB/sec
          Show
          Scott Carey added a comment - This version of PureJavaCrc32 is between 10x and 1.8x faster than Sun's native implementation depending on the chunk size. Results on my latop (mac osx with Java 1.6 64 bit, -Xmx512m, 2.5Ghz processor) below. Run to run, results vary by about 5%. CRC32Class num bytes throughput PureJava 1 99.533 MB/sec SunNative 1 9.772 MB/sec PureJava 2 163.265 MB/sec SunNative 2 18.846 MB/sec PureJava 4 234.004 MB/sec SunNative 4 37.307 MB/sec PureJava 8 307.692 MB/sec SunNative 8 66.876 MB/sec PureJava 16 432.432 MB/sec SunNative 16 110.919 MB/sec PureJava 32 522.449 MB/sec SunNative 32 161.616 MB/sec PureJava 64 547.009 MB/sec SunNative 64 217.687 MB/sec PureJava 128 432.432 MB/sec SunNative 128 270.042 MB/sec PureJava 256 551.724 MB/sec SunNative 256 299.065 MB/sec PureJava 512 615.385 MB/sec SunNative 512 321.608 MB/sec PureJava 1024 551.724 MB/sec SunNative 1024 212.625 MB/sec PureJava 2048 561.404 MB/sec SunNative 2048 309.179 MB/sec PureJava 4096 551.724 MB/sec SunNative 4096 307.692 MB/sec PureJava 8192 589.862 MB/sec SunNative 8192 316.049 MB/sec PureJava 16384 640.000 MB/sec SunNative 16384 343.164 MB/sec PureJava 32768 643.216 MB/sec SunNative 32768 343.164 MB/sec PureJava 65536 621.359 MB/sec SunNative 65536 345.013 MB/sec PureJava 131072 636.816 MB/sec SunNative 131072 345.946 MB/sec PureJava 262144 636.816 MB/sec SunNative 262144 343.164 MB/sec PureJava 524288 646.465 MB/sec SunNative 524288 345.946 MB/sec PureJava 1048576 640.000 MB/sec SunNative 1048576 343.164 MB/sec PureJava 2097152 633.663 MB/sec SunNative 2097152 347.826 MB/sec PureJava 4194304 636.816 MB/sec SunNative 4194304 291.572 MB/sec PureJava 8388608 618.357 MB/sec SunNative 8388608 342.246 MB/sec PureJava 16777216 624.390 MB/sec SunNative 16777216 307.692 MB/sec
          Hide
          Scott Carey added a comment -

          I additionally added my modifications to the test class and performance test to this JIRA.

          Show
          Scott Carey added a comment - I additionally added my modifications to the test class and performance test to this JIRA.
          Hide
          Owen O'Malley added a comment -

          Our problem with JNI mostly happens when you have large byte[] that you are using for your input. However, it depends a lot on the fragmentation of the heap and thus is not easy to benchmark against. It was in the context of doing the terabyte sort. The problem with JNI is that to get access to a byte[], the runtime may need to copy the array in/out of the C code. If the array is 100 mb, that takes a lot of time.

          Show
          Owen O'Malley added a comment - Our problem with JNI mostly happens when you have large byte[] that you are using for your input. However, it depends a lot on the fragmentation of the heap and thus is not easy to benchmark against. It was in the context of doing the terabyte sort. The problem with JNI is that to get access to a byte[], the runtime may need to copy the array in/out of the C code. If the array is 100 mb, that takes a lot of time.
          Hide
          Owen O'Malley added a comment -

          I should have commented earlier on this. I think the right solution is to use a pure Java impl if we can get the performance comparable in the "normal" case. If use a C implementation in libhadoop, it should use DirectByteBuffers and pool those buffers. Furthermore, it should be a different jira, since there are a lot more issues there.

          I'd also veto any code that dynamically switches implementations based on anything other that whether libhadoop is present. (ie. switching based on the size of the input is going to be unmaintainable)

          I can upload the code that I wrote for the pure java, if you want to see a third implementation.

          Show
          Owen O'Malley added a comment - I should have commented earlier on this. I think the right solution is to use a pure Java impl if we can get the performance comparable in the "normal" case. If use a C implementation in libhadoop, it should use DirectByteBuffers and pool those buffers. Furthermore, it should be a different jira, since there are a lot more issues there. I'd also veto any code that dynamically switches implementations based on anything other that whether libhadoop is present. (ie. switching based on the size of the input is going to be unmaintainable) I can upload the code that I wrote for the pure java, if you want to see a third implementation.
          Hide
          Todd Lipcon added a comment -

          Thanks, Owen. I definitely see your point about the JNI code possibly locking or copying. The API we're using says it tries to avoid doing that when possible, but I guess for large blocks, heap fragmentation is quite likely and we'd run into that issue.

          I'm going on vacation this next week, so I'm unlikely to work on it, but if you have a moment to upload your implementation for comparison I'll be back on this the week after next.

          Show
          Todd Lipcon added a comment - Thanks, Owen. I definitely see your point about the JNI code possibly locking or copying. The API we're using says it tries to avoid doing that when possible, but I guess for large blocks, heap fragmentation is quite likely and we'd run into that issue. I'm going on vacation this next week, so I'm unlikely to work on it, but if you have a moment to upload your implementation for comparison I'll be back on this the week after next.
          Hide
          Todd Lipcon added a comment -

          Hi Scott,

          I'm looking over your patch. One question:

              int seed = 0xEDB88320;
              for (int i = 0; i < 256; i++) {
                // T1 is the ~crc for each single byte
                // create this with the crc seed 0xdebb20e3 for crc32
          

          The seed indicated in the comments doesn't line up with the seed in the code. Can you please explain this?

          Also, I'm glad to see the code that generates the tables T1 through T4, but I'm not certain it belongs in the actual patch, given it's essentially "dead code". Perhaps it's best to just put the code in this JIRA as a separate program with a main() function for future reference. What do others think?

          Show
          Todd Lipcon added a comment - Hi Scott, I'm looking over your patch. One question: int seed = 0xEDB88320; for ( int i = 0; i < 256; i++) { // T1 is the ~crc for each single byte // create this with the crc seed 0xdebb20e3 for crc32 The seed indicated in the comments doesn't line up with the seed in the code. Can you please explain this? Also, I'm glad to see the code that generates the tables T1 through T4, but I'm not certain it belongs in the actual patch, given it's essentially "dead code". Perhaps it's best to just put the code in this JIRA as a separate program with a main() function for future reference. What do others think?
          Hide
          Todd Lipcon added a comment -

          Here is a patch with Scott's latest implementation with a few changes:

          • Style cleanups (mostly whitespace, some more comments)
          • Removed the code to generate T1 through T4 since it should never change and the JIRA is here for reference. If people -1 this I'll submit a new patch with it included.

          Although this JIRA is for the benefit of HDFS, the code itself is in o.a.h.util, so this patch applies against the Common repository. We'll need a second patch against the HDFS repo to make the one-liner change from "new CRC32" to "new PureJavaCrc32".

          Should we move this JIRA back to Common and add a HDFS JIRA for the oneliner patch? Not sure how pedantic we're supposed to be about which JIRAs and patches apply where

          Show
          Todd Lipcon added a comment - Here is a patch with Scott's latest implementation with a few changes: Style cleanups (mostly whitespace, some more comments) Removed the code to generate T1 through T4 since it should never change and the JIRA is here for reference. If people -1 this I'll submit a new patch with it included. Although this JIRA is for the benefit of HDFS, the code itself is in o.a.h.util, so this patch applies against the Common repository. We'll need a second patch against the HDFS repo to make the one-liner change from "new CRC32" to "new PureJavaCrc32". Should we move this JIRA back to Common and add a HDFS JIRA for the oneliner patch? Not sure how pedantic we're supposed to be about which JIRAs and patches apply where
          Hide
          Scott Carey added a comment -

          The seed indicated in the comments doesn't line up with the seed in the code. Can you please explain this?

          Ah, I forgot that detail. So, the seed in the comment was from the pkzip description of the CRC32 used there. But the lookup tables being built here used a different representation of the same polynomial to generate the single byte crc.

          The T1 for this algorithm differs from the TABLE in the older versions because it operates on the bit-flipped crc and then requires a bit-flip to return the result. This, combined with using code that shifts to the right in generation:

                
          for (int j = 8; j > 0; j--) {
            if ((crc & 0x1) == 0x1) {
              crc = seed ^ (crc >>> 1);
            } else {
              crc = crc >>> 1;
            }    
          }
          

          requires switching around the seed representation.

          I believe the eventual one I used is a combination of byte order, reversal, and perhaps bit-flip of the other one. I got it via trial and error looking for the first 4 entries in the map that I expected rather than any proof.

          Show
          Scott Carey added a comment - The seed indicated in the comments doesn't line up with the seed in the code. Can you please explain this? Ah, I forgot that detail. So, the seed in the comment was from the pkzip description of the CRC32 used there. But the lookup tables being built here used a different representation of the same polynomial to generate the single byte crc. The T1 for this algorithm differs from the TABLE in the older versions because it operates on the bit-flipped crc and then requires a bit-flip to return the result. This, combined with using code that shifts to the right in generation: for ( int j = 8; j > 0; j--) { if ((crc & 0x1) == 0x1) { crc = seed ^ (crc >>> 1); } else { crc = crc >>> 1; } } requires switching around the seed representation. I believe the eventual one I used is a combination of byte order, reversal, and perhaps bit-flip of the other one. I got it via trial and error looking for the first 4 entries in the map that I expected rather than any proof.
          Hide
          Todd Lipcon added a comment -

          Thanks for the clarification, Scott.

          I believe this patch is ready to go. Any objections to commit? After commit I'll follow up with an HDFS JIRA.

          Show
          Todd Lipcon added a comment - Thanks for the clarification, Scott. I believe this patch is ready to go. Any objections to commit? After commit I'll follow up with an HDFS JIRA.
          Hide
          Raghu Angadi added a comment -

          +1 for committing. pretty significant improvement in CPU.

          Show
          Raghu Angadi added a comment - +1 for committing. pretty significant improvement in CPU.
          Hide
          dhruba borthakur added a comment -

          This looks good. Does it make sense to run the unit test for longer period of time (more than the current 24 iterations) just so as to get more coverage?

          Is this new code written from scratch or has it been taken from some Apache project (just making sure that there aren't any GPL/LGPL issues)

          Show
          dhruba borthakur added a comment - This looks good. Does it make sense to run the unit test for longer period of time (more than the current 24 iterations) just so as to get more coverage? Is this new code written from scratch or has it been taken from some Apache project (just making sure that there aren't any GPL/LGPL issues)
          Hide
          Todd Lipcon added a comment -

          Dhruba: During testing I ran on many more iterations than just the 24. Whether it lives in the unit test with more or less I don't have a big opinion on. Seems to me that it should be fast enough that we can run a few thousand with no noticeable test performance degradation.

          As for the code, I believe Scott wrote it with some guidance from some C source, but it's definitely not ripped from another project - I googled far and wide for a good implementation in Java and didn't find any that were this fast. I think we now have the fastest crc32 in the west

          Show
          Todd Lipcon added a comment - Dhruba: During testing I ran on many more iterations than just the 24. Whether it lives in the unit test with more or less I don't have a big opinion on. Seems to me that it should be fast enough that we can run a few thousand with no noticeable test performance degradation. As for the code, I believe Scott wrote it with some guidance from some C source, but it's definitely not ripped from another project - I googled far and wide for a good implementation in Java and didn't find any that were this fast. I think we now have the fastest crc32 in the west
          Hide
          dhruba borthakur added a comment -

          > During testing I ran on many more iterations than just the 24. W

          Thanks for the info.

          > I think we now have the fastest crc32 in the wes

          Way to go!

          Show
          dhruba borthakur added a comment - > During testing I ran on many more iterations than just the 24. W Thanks for the info. > I think we now have the fastest crc32 in the wes Way to go!
          Hide
          Scott Carey added a comment -

          Is this new code written from scratch or has it been taken from some Apache project (just making sure that there aren't any GPL/LGPL issues)

          All the versions I wrote were from scratch, modifications from Todd's first version posted. That one uses a slight modification of the ubiquitous one-byte-at-a-time CRC32 algorithm. There are at lest a few dozen variants of this algorithm available in any license you wish, including public domain.

          Less common 4 and 8 byte at a time algorithms are out there. The one that inspired me was Intel's "slicing-by-8" (BSD license) C implementation here http://sourceforge.net/projects/slicing-by-8 From this IEEE paper (That I did not read, I don't have access to the publication) http://www2.computer.org/portal/web/csdl/doi/10.1109/TC.2008.85
          I did not copy any of that code (or build or execute it), it was conceptual reference however.

          Doing some searching on google today:
          I can't find a single implementation of a four-bytes-at-a-time approach in java.
          The four bytes at a time approach is not new, you can find it elsewhere, coded in C:
          Sun (CDDL)
          http://www.google.com/codesearch/p?hl=en&sa=N&cd=7&ct=rc#KSEQAa6EaaE/net-virt-src-20070216/usr/src/uts/common/inet/sctp_crc32.c&q=file:crc32%20four%20bytes
          A common zlib variant (at least 15 copies of this in the first few pages of 'file:crc32' search:
          http://www.google.com/codesearch/p?hl=en&sa=N&cd=9&ct=rc#5Y_n4SyFm44/digest/src/crc32.c&q=file:crc32%20four%20bytes

          I also just now found an implementation that uses the same same seed I did (but one byte at a time, public domain with copyright on the source).
          http://c.snippets.org/snip_lister.php?fname=crc_32.c

          So, all things considered above I don't think there are intellectual property issues. The idea came from a description of a whitepaper (but not the actual whitepaper) from Intel that has example BSD licensed source code in C where the actual algorithm is significantly different from mine, but demonstrates working more than one byte at a time with multiple lookup tables. Four bytes at a time algorithms exist in C code under CDDL and zlib, but differ a lot from this one. None of the above source was copied.
          Furthermore, there is no java code like it that I can find anywhere.

          Show
          Scott Carey added a comment - Is this new code written from scratch or has it been taken from some Apache project (just making sure that there aren't any GPL/LGPL issues) All the versions I wrote were from scratch, modifications from Todd's first version posted. That one uses a slight modification of the ubiquitous one-byte-at-a-time CRC32 algorithm. There are at lest a few dozen variants of this algorithm available in any license you wish, including public domain. Less common 4 and 8 byte at a time algorithms are out there. The one that inspired me was Intel's "slicing-by-8" (BSD license) C implementation here http://sourceforge.net/projects/slicing-by-8 From this IEEE paper (That I did not read, I don't have access to the publication) http://www2.computer.org/portal/web/csdl/doi/10.1109/TC.2008.85 I did not copy any of that code (or build or execute it), it was conceptual reference however. Doing some searching on google today: I can't find a single implementation of a four-bytes-at-a-time approach in java. The four bytes at a time approach is not new, you can find it elsewhere, coded in C: Sun (CDDL) http://www.google.com/codesearch/p?hl=en&sa=N&cd=7&ct=rc#KSEQAa6EaaE/net-virt-src-20070216/usr/src/uts/common/inet/sctp_crc32.c&q=file:crc32%20four%20bytes A common zlib variant (at least 15 copies of this in the first few pages of 'file:crc32' search: http://www.google.com/codesearch/p?hl=en&sa=N&cd=9&ct=rc#5Y_n4SyFm44/digest/src/crc32.c&q=file:crc32%20four%20bytes I also just now found an implementation that uses the same same seed I did (but one byte at a time, public domain with copyright on the source). http://c.snippets.org/snip_lister.php?fname=crc_32.c So, all things considered above I don't think there are intellectual property issues. The idea came from a description of a whitepaper (but not the actual whitepaper) from Intel that has example BSD licensed source code in C where the actual algorithm is significantly different from mine, but demonstrates working more than one byte at a time with multiple lookup tables. Four bytes at a time algorithms exist in C code under CDDL and zlib, but differ a lot from this one. None of the above source was copied. Furthermore, there is no java code like it that I can find anywhere.
          Hide
          Scott Carey added a comment -

          I think we now have the fastest crc32 in the west

          There are two ways to still go faster:

          1. an 8-byte at a time approach based off the Intel paper.
          2. Being able to read 4 consecutive bytes from a byte array without shifting each separately. Sort of like ByteBuffer.getInt(). However, I tried ByteBuffers and they were substantially slower if allocated via ByteBuffer.wrap() in the API here. An API change with a preallocated one might work though.

          Honestly, I just wish the JVM had a cleaner way to pull out Integers/Longs from byte arrays that it could optimize to be like casting a pointer in C and the JVM could optimize if the load was aligned. It would make a lot of low level byte-wise algorithms in Java just as fast as C.

          Show
          Scott Carey added a comment - I think we now have the fastest crc32 in the west There are two ways to still go faster: 1. an 8-byte at a time approach based off the Intel paper. 2. Being able to read 4 consecutive bytes from a byte array without shifting each separately. Sort of like ByteBuffer.getInt(). However, I tried ByteBuffers and they were substantially slower if allocated via ByteBuffer.wrap() in the API here. An API change with a preallocated one might work though. Honestly, I just wish the JVM had a cleaner way to pull out Integers/Longs from byte arrays that it could optimize to be like casting a pointer in C and the JVM could optimize if the load was aligned. It would make a lot of low level byte-wise algorithms in Java just as fast as C.
          Hide
          Todd Lipcon added a comment -

          Moved this back to Common since this patch is for util. Just to make sure, I ran the current patch with 1M iterations instead of just 24 and it still passed (but took 44 seconds).

          I believe the IP issues look to be clean, so this should be ready for commit.

          Show
          Todd Lipcon added a comment - Moved this back to Common since this patch is for util. Just to make sure, I ran the current patch with 1M iterations instead of just 24 and it still passed (but took 44 seconds). I believe the IP issues look to be clean, so this should be ready for commit.
          Hide
          Raghu Angadi added a comment -

          Scott and Todd, any plans to inform OpenJDK guys about this?

          Show
          Raghu Angadi added a comment - Scott and Todd, any plans to inform OpenJDK guys about this?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          I think the codes could be improved a little bit as following.

              public void update(byte[] b, int off, final int len) {
                for(final int end = off + (len & ~3); off < end; ) {
                  final int c0 = crc ^ b[off++];
                  final int c1 = (crc >>>= 8) ^ b[off++];
                  final int c2 = (crc >>>= 8) ^ b[off++];
                  final int c3 = (crc >>>= 8) ^ b[off++];
                  crc = T4[c0 & 0xff] ^ T3[c1 & 0xff] ^ T2[c2 & 0xff] ^ T1[c3 & 0xff];
                }
                
                for(final int end = off + (len & 3); off < end; ) {
                  crc = (crc >>> 8) ^ T1[(crc ^ b[off++]) & 0xff];
                }
              }
          

          PureJavaCrc32.java: I moved the codes around and did some benchmarks.

          Show
          Tsz Wo Nicholas Sze added a comment - I think the codes could be improved a little bit as following. public void update( byte [] b, int off, final int len) { for ( final int end = off + (len & ~3); off < end; ) { final int c0 = crc ^ b[off++]; final int c1 = (crc >>>= 8) ^ b[off++]; final int c2 = (crc >>>= 8) ^ b[off++]; final int c3 = (crc >>>= 8) ^ b[off++]; crc = T4[c0 & 0xff] ^ T3[c1 & 0xff] ^ T2[c2 & 0xff] ^ T1[c3 & 0xff]; } for ( final int end = off + (len & 3); off < end; ) { crc = (crc >>> 8) ^ T1[(crc ^ b[off++]) & 0xff]; } } PureJavaCrc32.java: I moved the codes around and did some benchmarks.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          benchmarks20090714.txt: benchmarks results

          Show
          Tsz Wo Nicholas Sze added a comment - benchmarks20090714.txt: benchmarks results
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Changed the benchmark as below such that there are ~4GB data in each run.

              for(int j = 10; j < 24; j += 2) {
                for(int k = 0; k < 4; k++) {
                  final int bytelen = (1 << j) + k;
                  final byte[] b = new byte[bytelen];
                  final int n = (int)((1L << 32) / bytelen);
              
                  ran.nextBytes(b);
                  t.tick("ran.nextBytes, bytelen=" + bytelen);
              
                  final SortedMap<Long, Checksum> rank = new TreeMap<Long, Checksum>(); 
                  test(pure, b, n, t, rank);
                  test(test, b, n, t, rank);
                  test(zip, b, n, t, rank);
          
                  System.out.println("rank = " + rank);
                  final Checksum c = rank.entrySet().iterator().next().getValue();
                  fastest.put(c, fastest.get(c) + 1);
                }
              }
          

          benchmarks20090715.txt: new results

          It is consistent that TestCrc32 is faster than zip.CRC32, which is faster than PureJavaCrc32. There are ~13% improvement by TestCrc32 over PureJavaCrc32.

          Show
          Tsz Wo Nicholas Sze added a comment - Changed the benchmark as below such that there are ~4GB data in each run. for ( int j = 10; j < 24; j += 2) { for ( int k = 0; k < 4; k++) { final int bytelen = (1 << j) + k; final byte [] b = new byte [bytelen]; final int n = ( int )((1L << 32) / bytelen); ran.nextBytes(b); t.tick( "ran.nextBytes, bytelen=" + bytelen); final SortedMap< Long , Checksum> rank = new TreeMap< Long , Checksum>(); test(pure, b, n, t, rank); test(test, b, n, t, rank); test(zip, b, n, t, rank); System .out.println( "rank = " + rank); final Checksum c = rank.entrySet().iterator().next().getValue(); fastest.put(c, fastest.get(c) + 1); } } benchmarks20090715.txt: new results It is consistent that TestCrc32 is faster than zip.CRC32, which is faster than PureJavaCrc32. There are ~13% improvement by TestCrc32 over PureJavaCrc32.
          Hide
          Todd Lipcon added a comment -

          Thanks for the improvements, Nicholas. I should have a chance to verify your findings today or tomorrow.

          At first glance, it seems odd that zip.CRC32 is faster than PureJavaCrc32. Are you using the Sun JDK or OpenJDK? In my tests for large block sizes, Sun's JDK is slower than PureJavaCrc32, but OpenJDK's is faster.

          Show
          Todd Lipcon added a comment - Thanks for the improvements, Nicholas. I should have a chance to verify your findings today or tomorrow. At first glance, it seems odd that zip.CRC32 is faster than PureJavaCrc32. Are you using the Sun JDK or OpenJDK? In my tests for large block sizes, Sun's JDK is slower than PureJavaCrc32, but OpenJDK's is faster.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          I am using Sun JDK on Windows.

          bash-3.2$ java -version
          java version "1.6.0_13"
          Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
          Java HotSpot(TM) Client VM (build 11.3-b02, mixed mode, sharing)
          
          Show
          Tsz Wo Nicholas Sze added a comment - I am using Sun JDK on Windows. bash-3.2$ java -version java version "1.6.0_13" Java(TM) SE Runtime Environment (build 1.6.0_13-b03) Java HotSpot(TM) Client VM (build 11.3-b02, mixed mode, sharing)
          Hide
          Scott Carey added a comment -

          Whatever that Timer class is that is being used to measure, it is using the system millisecond time, which has only ~15ms accuracy so the test run time needs to be long to be accurate. To get more accurate results, use System.nanoTime().
          Also, I have found that a couple 'warmup' iterations of a test make the results much more consistent, to avoid the JIT interfering.

          Using the benchmark I did before (previously attached, TestCrc32Performance.java), the new version is consistently 10% slower than the previous on my machine (Java 6, Max OS X, Core2 Duo processor, 64 bit JVM). On Sun JRE 6.0_u14 on Linux (64 bit) with different CPUs, results vary. I'll dig into those details below.

          Results should be normalized to a metric we can compare to – we have been using MB/sec so far. Additionally, the JVM and environment used is critical. Java 1.5 behaves VERY differently and I would expect changes like this to behave differently there (as well as with OpenJDK, IBM, or JRockit).

          There are two changes here – the loop format and termination, and how the shifts and masks are packed.

          Adding or removing the final declarations makes no difference in my testing – the compiler easily identifies whether variables change or not.

          These two changes, taken alone, or taken together have varied results. This is probably because loop unrolling in the JIT behaves differently. Generally, the loop change helped the least, it only avoids a decrement on a loop condition, which is essentially free for some processors, and it has higher start-up cost and more variables.

          Here are results on my laptop, followed by two different servers.

          Note, that the first two results in any test are always lower than the rest, regardless of what order i do the tests, so consider the 'size 1' scores suspect.

          Here are my results with the new version posted on my laptop Core2 Duo – 2.5Ghz, 4GB 667Mhz DDR2 SDRAM, OSX 10.5.7 Java 6 (Apple):

          $ java -Xmx512m -cp testall.jar org.apache.hadoop.util.TestCrc32Performance

          num bytes NewLoopOnly MB/sec NewInnerOnly MB/sec NewPureJava MB/sec PureJava MB/sec Native MB/sec
          1 85.195 94.160 121.054 117.785 6.580
          2 123.629 133.855 131.112 164.309 12.354
          4 226.677 220.528 194.145 287.520 24.309
          8 240.169 262.491 253.665 283.975 45.353
          16 343.329 364.299 354.749 383.966 77.207
          32 441.347 445.800 433.381 462.508 122.829
          64 522.195 522.210 502.894 528.559 188.184
          128 570.476 551.149 541.540 555.572 194.542
          256 596.147 577.042 558.884 591.713 289.183
          512 601.956 583.593 561.882 593.323 315.714
          1024 623.217 592.406 577.292 603.304 332.319
          2048 623.979 594.163 581.419 606.365 341.448
          4096 624.345 596.289 584.365 610.018 344.685
          8192 626.711 593.323 585.424 607.542 347.891
          16384 625.938 599.650 584.414 607.903 350.221
          32768 623.995 583.516 577.771 609.930 349.754
          65536 623.906 594.321 578.602 610.338 347.915
          131072 624.308 595.950 577.024 610.308 350.647
          262144 629.946 590.831 577.453 610.285 351.603
          524288 623.757 597.578 575.428 610.399 349.063
          1048576 624.043 596.817 577.521 610.798 352.213
          2097152 627.303 591.817 573.852 610.981 352.406
          4194304 623.168 593.048 570.512 605.679 347.898
          8388608 609.118 587.879 562.892 600.384 344.380
          16777216 610.116 585.480 555.988 601.001 348.796

          For the above, the new loop helps, the new inner code hurts, and combined it is worst. For small checksum sizes all are worse.

          For a Linux server (Dual quad core Xeon E5335 @ 2.00GHz), Sun JDK 1.6.0_u14 the results are different.
          The Loop change does not help, changing the inner code alone helps the most, and combining the two is somewhere in-between, and for very small sizes its always slower:

          java -Xmx512m -XX:+UseParallelOldGC -XX:+UseCompressedOops -cp testall.jar org.apache.hadoop.util.TestCrc32Performance

          num bytes NewLoopOnly MB/sec NewInnerOnly MB/sec NewPureJava MB/sec PureJava MB/sec Native MB/sec
          1 66.281 66.264 85.825 93.888 7.331
          2 93.812 94.895 92.638 129.917 13.931
          4 155.431 144.819 161.540 178.275 26.586
          8 174.531 185.222 206.577 185.253 47.892
          16 244.490 275.026 292.237 256.664 81.922
          32 306.768 351.271 345.710 313.523 127.560
          64 350.763 407.539 382.370 352.997 175.000
          128 377.402 442.239 406.650 376.721 218.138
          256 392.901 461.884 420.526 390.215 247.586
          512 397.682 467.375 425.522 393.319 264.732
          1024 404.145 474.192 430.791 391.678 272.883
          2048 407.673 478.888 433.719 400.236 277.638
          4096 409.028 480.907 435.645 401.338 280.214
          8192 409.890 482.769 436.209 402.037 281.782
          16384 409.565 482.726 436.340 401.767 282.409
          32768 407.943 481.176 436.373 399.955 282.455
          65536 407.933 481.746 435.970 400.228 282.761
          131072 408.003 481.723 436.516 399.973 282.749
          262144 407.412 481.357 436.067 400.962 283.193
          524288 408.077 481.335 436.416 401.280 283.137
          1048576 408.016 481.625 436.086 402.039 283.308
          2097152 407.397 481.386 436.131 401.394 283.353
          4194304 406.609 479.130 434.960 400.632 282.376
          8388608 403.235 475.130 430.770 397.797 280.904
          16777216 402.891 474.464 430.427 397.324 280.951

          Lastly, I have access to a dual quad core Nehalem system Xeon X5550 @ 2.67GHz. The trend is similar to the other Linux server.

          num bytes NewLoopOnly MB/sec NewInnerOnly MB/sec NewPureJava MB/sec PureJava MB/sec Native MB/sec
          1 100.809 105.911 124.574 168.577 17.863
          2 144.671 157.576 150.993 203.188 33.722
          4 230.455 226.380 250.517 276.617 64.370
          8 263.961 288.727 308.807 292.203 103.247
          16 345.860 419.419 454.893 394.811 158.828
          32 437.694 536.016 534.576 470.360 226.527
          64 503.768 611.457 579.010 470.782 281.814
          128 537.179 650.089 614.528 522.099 317.293
          256 557.598 672.534 630.303 558.545 344.511
          512 561.642 677.876 611.557 463.531 345.938
          1024 565.044 690.742 615.121 558.021 367.634
          2048 583.641 714.832 634.747 602.027 372.008
          4096 594.691 717.332 643.951 598.466 369.619
          8192 562.813 639.987 596.214 560.489 366.227
          16384 540.490 682.933 481.701 571.237 373.531
          32768 520.446 636.502 563.346 539.832 333.878
          65536 498.553 617.019 556.558 528.404 343.788
          131072 526.779 625.905 585.023 520.933 333.155
          262144 535.428 617.910 568.184 530.084 337.989
          524288 563.442 637.368 623.922 578.144 379.028
          1048576 578.139 709.259 632.558 596.539 374.711
          2097152 537.883 662.951 634.672 583.292 363.072
          4194304 551.819 677.231 632.805 580.628 372.185
          8388608 581.055 689.236 624.939 584.958 358.719
          16777216 563.078 675.309 566.319 579.324 365.306

          So it looks like the winner is to change the inner part of the loop only. Although this hurts a bit on Apple's VM, that isn't a production VM. Unless there is data for some other VM's that suggests this is a poor choice it looks best to me. It is slightly slower for sizes less than about 8 bytes however The code is:

            public void update(byte[] b, int off, int len) {
              while(len > 3) {
                int c0 = crc ^ b[off++];
                int c1 = (crc >>>= 8) ^ b[off++];
                int c2 = (crc >>>= 8) ^ b[off++];
                int c3 = (crc >>>= 8) ^ b[off++];
                crc = T4[c0 & 0xff] ^ T3[c1 & 0xff] ^ T2[c2 & 0xff] ^ T1[c3 & 0xff];
                len -=4;
              }
              while(len > 0) {
                crc = (crc >>> 8) ^ T1[(crc ^ b[off++]) & 0xff];
                len --;
              }
            }
          Show
          Scott Carey added a comment - Whatever that Timer class is that is being used to measure, it is using the system millisecond time, which has only ~15ms accuracy so the test run time needs to be long to be accurate. To get more accurate results, use System.nanoTime(). Also, I have found that a couple 'warmup' iterations of a test make the results much more consistent, to avoid the JIT interfering. Using the benchmark I did before (previously attached, TestCrc32Performance.java), the new version is consistently 10% slower than the previous on my machine (Java 6, Max OS X, Core2 Duo processor, 64 bit JVM). On Sun JRE 6.0_u14 on Linux (64 bit) with different CPUs, results vary. I'll dig into those details below. Results should be normalized to a metric we can compare to – we have been using MB/sec so far. Additionally, the JVM and environment used is critical. Java 1.5 behaves VERY differently and I would expect changes like this to behave differently there (as well as with OpenJDK, IBM, or JRockit). There are two changes here – the loop format and termination, and how the shifts and masks are packed. Adding or removing the final declarations makes no difference in my testing – the compiler easily identifies whether variables change or not. These two changes, taken alone, or taken together have varied results. This is probably because loop unrolling in the JIT behaves differently. Generally, the loop change helped the least, it only avoids a decrement on a loop condition, which is essentially free for some processors, and it has higher start-up cost and more variables. Here are results on my laptop, followed by two different servers. Note, that the first two results in any test are always lower than the rest, regardless of what order i do the tests, so consider the 'size 1' scores suspect. Here are my results with the new version posted on my laptop Core2 Duo – 2.5Ghz, 4GB 667Mhz DDR2 SDRAM, OSX 10.5.7 Java 6 (Apple): $ java -Xmx512m -cp testall.jar org.apache.hadoop.util.TestCrc32Performance num bytes NewLoopOnly MB/sec NewInnerOnly MB/sec NewPureJava MB/sec PureJava MB/sec Native MB/sec 1 85.195 94.160 121.054 117.785 6.580 2 123.629 133.855 131.112 164.309 12.354 4 226.677 220.528 194.145 287.520 24.309 8 240.169 262.491 253.665 283.975 45.353 16 343.329 364.299 354.749 383.966 77.207 32 441.347 445.800 433.381 462.508 122.829 64 522.195 522.210 502.894 528.559 188.184 128 570.476 551.149 541.540 555.572 194.542 256 596.147 577.042 558.884 591.713 289.183 512 601.956 583.593 561.882 593.323 315.714 1024 623.217 592.406 577.292 603.304 332.319 2048 623.979 594.163 581.419 606.365 341.448 4096 624.345 596.289 584.365 610.018 344.685 8192 626.711 593.323 585.424 607.542 347.891 16384 625.938 599.650 584.414 607.903 350.221 32768 623.995 583.516 577.771 609.930 349.754 65536 623.906 594.321 578.602 610.338 347.915 131072 624.308 595.950 577.024 610.308 350.647 262144 629.946 590.831 577.453 610.285 351.603 524288 623.757 597.578 575.428 610.399 349.063 1048576 624.043 596.817 577.521 610.798 352.213 2097152 627.303 591.817 573.852 610.981 352.406 4194304 623.168 593.048 570.512 605.679 347.898 8388608 609.118 587.879 562.892 600.384 344.380 16777216 610.116 585.480 555.988 601.001 348.796 For the above, the new loop helps, the new inner code hurts, and combined it is worst. For small checksum sizes all are worse. For a Linux server (Dual quad core Xeon E5335 @ 2.00GHz), Sun JDK 1.6.0_u14 the results are different. The Loop change does not help, changing the inner code alone helps the most, and combining the two is somewhere in-between, and for very small sizes its always slower: java -Xmx512m -XX:+UseParallelOldGC -XX:+UseCompressedOops -cp testall.jar org.apache.hadoop.util.TestCrc32Performance num bytes NewLoopOnly MB/sec NewInnerOnly MB/sec NewPureJava MB/sec PureJava MB/sec Native MB/sec 1 66.281 66.264 85.825 93.888 7.331 2 93.812 94.895 92.638 129.917 13.931 4 155.431 144.819 161.540 178.275 26.586 8 174.531 185.222 206.577 185.253 47.892 16 244.490 275.026 292.237 256.664 81.922 32 306.768 351.271 345.710 313.523 127.560 64 350.763 407.539 382.370 352.997 175.000 128 377.402 442.239 406.650 376.721 218.138 256 392.901 461.884 420.526 390.215 247.586 512 397.682 467.375 425.522 393.319 264.732 1024 404.145 474.192 430.791 391.678 272.883 2048 407.673 478.888 433.719 400.236 277.638 4096 409.028 480.907 435.645 401.338 280.214 8192 409.890 482.769 436.209 402.037 281.782 16384 409.565 482.726 436.340 401.767 282.409 32768 407.943 481.176 436.373 399.955 282.455 65536 407.933 481.746 435.970 400.228 282.761 131072 408.003 481.723 436.516 399.973 282.749 262144 407.412 481.357 436.067 400.962 283.193 524288 408.077 481.335 436.416 401.280 283.137 1048576 408.016 481.625 436.086 402.039 283.308 2097152 407.397 481.386 436.131 401.394 283.353 4194304 406.609 479.130 434.960 400.632 282.376 8388608 403.235 475.130 430.770 397.797 280.904 16777216 402.891 474.464 430.427 397.324 280.951 Lastly, I have access to a dual quad core Nehalem system Xeon X5550 @ 2.67GHz. The trend is similar to the other Linux server. num bytes NewLoopOnly MB/sec NewInnerOnly MB/sec NewPureJava MB/sec PureJava MB/sec Native MB/sec 1 100.809 105.911 124.574 168.577 17.863 2 144.671 157.576 150.993 203.188 33.722 4 230.455 226.380 250.517 276.617 64.370 8 263.961 288.727 308.807 292.203 103.247 16 345.860 419.419 454.893 394.811 158.828 32 437.694 536.016 534.576 470.360 226.527 64 503.768 611.457 579.010 470.782 281.814 128 537.179 650.089 614.528 522.099 317.293 256 557.598 672.534 630.303 558.545 344.511 512 561.642 677.876 611.557 463.531 345.938 1024 565.044 690.742 615.121 558.021 367.634 2048 583.641 714.832 634.747 602.027 372.008 4096 594.691 717.332 643.951 598.466 369.619 8192 562.813 639.987 596.214 560.489 366.227 16384 540.490 682.933 481.701 571.237 373.531 32768 520.446 636.502 563.346 539.832 333.878 65536 498.553 617.019 556.558 528.404 343.788 131072 526.779 625.905 585.023 520.933 333.155 262144 535.428 617.910 568.184 530.084 337.989 524288 563.442 637.368 623.922 578.144 379.028 1048576 578.139 709.259 632.558 596.539 374.711 2097152 537.883 662.951 634.672 583.292 363.072 4194304 551.819 677.231 632.805 580.628 372.185 8388608 581.055 689.236 624.939 584.958 358.719 16777216 563.078 675.309 566.319 579.324 365.306 So it looks like the winner is to change the inner part of the loop only. Although this hurts a bit on Apple's VM, that isn't a production VM. Unless there is data for some other VM's that suggests this is a poor choice it looks best to me. It is slightly slower for sizes less than about 8 bytes however The code is: public void update( byte [] b, int off, int len) { while (len > 3) { int c0 = crc ^ b[off++]; int c1 = (crc >>>= 8) ^ b[off++]; int c2 = (crc >>>= 8) ^ b[off++]; int c3 = (crc >>>= 8) ^ b[off++]; crc = T4[c0 & 0xff] ^ T3[c1 & 0xff] ^ T2[c2 & 0xff] ^ T1[c3 & 0xff]; len -=4; } while (len > 0) { crc = (crc >>> 8) ^ T1[(crc ^ b[off++]) & 0xff]; len --; } }
          Hide
          Scott Carey added a comment -

          variations of the CRC for testing.

          Show
          Scott Carey added a comment - variations of the CRC for testing.
          Hide
          Scott Carey added a comment -

          New test, testing performance of many crc variants and spending more time warming up each of them to make sure the JIT has compiled them to native. System.gc() calls were experimented with to make results more consistent from run to run. Typically, I have been running this with -Xmx512m

          Show
          Scott Carey added a comment - New test, testing performance of many crc variants and spending more time warming up each of them to make sure the JIT has compiled them to native. System.gc() calls were experimented with to make results more consistent from run to run. Typically, I have been running this with -Xmx512m
          Hide
          Scott Carey added a comment -

          I am using Sun JDK on Windows.

          bash-3.2$ java -version
          java version "1.6.0_13"
          Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
          Java HotSpot(TM) Client VM (build 11.3-b02, mixed mode, sharing)
          

          You are using the Client VM, not the Server VM. Try some tests with the -server flag.
          You are also using a 32 bit VM, which may have a small effect here. But the Server versus Client VM makes a big difference here.

          Show
          Scott Carey added a comment - I am using Sun JDK on Windows. bash-3.2$ java -version java version "1.6.0_13" Java(TM) SE Runtime Environment (build 1.6.0_13-b03) Java HotSpot(TM) Client VM (build 11.3-b02, mixed mode, sharing) You are using the Client VM, not the Server VM. Try some tests with the -server flag. You are also using a 32 bit VM, which may have a small effect here. But the Server versus Client VM makes a big difference here.
          Hide
          Todd Lipcon added a comment -

          Here are the results from my laptop. PureJavaTW is Tsz Wo's updated version.

          First, 64-bit JVM:

          num bytes NewLoopOnly MB/sec NewInnerOnly MB/sec NewPureJava MB/sec PureJava MB/sec PureJavaTW MB/sec Native MB/sec
          1 79.721 87.113 111.889 112.642 88.684 9.001
          2 91.820 130.441 130.855 133.645 131.948 17.536
          4 167.315 223.087 198.984 243.936 198.387 33.488
          8 188.153 254.671 243.899 248.944 242.746 59.258
          16 311.118 353.191 350.826 327.231 333.589 99.213
          32 401.696 416.339 417.406 427.676 408.610 148.742
          64 499.747 483.148 445.437 472.814 487.970 208.478
          128 530.055 520.043 505.709 513.645 457.080 253.756
          256 489.407 541.459 523.867 547.794 510.867 283.360
          512 561.871 528.383 528.368 553.071 530.421 300.134
          1024 579.227 549.401 537.941 551.488 536.391 293.307
          2048 586.443 551.685 540.289 564.328 539.412 319.766
          4096 608.470 573.333 560.746 586.014 560.349 332.661
          8192 590.123 554.975 543.089 424.834 517.325 322.192
          16384 583.385 539.704 542.656 567.484 542.567 324.026
          32768 583.592 551.508 533.585 561.559 529.811 321.858
          65536 584.476 553.217 537.679 544.507 512.978 310.739
          131072 548.941 529.097 534.430 564.858 533.955 324.626
          262144 584.379 551.733 536.386 564.063 479.038 324.117
          524288 583.262 553.893 536.770 563.518 532.404 324.924
          1048576 581.947 550.572 533.850 561.049 512.846 294.452
          2097152 566.543 534.695 484.256 551.693 527.730 320.850
          4194304 569.545 537.748 520.731 547.608 522.762 318.084
          8388608 593.932 563.233 530.600 571.098 545.905 310.115
          16777216 573.095 560.361 545.069 576.036 529.475 331.984

          And 32-bit JVM on the same machine:

          num bytes NewLoopOnly MB/sec NewInnerOnly MB/sec NewPureJava MB/sec PureJava MB/sec PureJavaTW MB/sec Native MB/sec
          1 56.832 51.138 60.116 60.826 47.624 7.612
          2 84.111 82.710 83.672 81.422 82.630 15.075
          4 167.204 138.900 147.189 163.798 147.762 30.063
          8 175.720 177.396 184.865 180.241 186.981 54.772
          16 227.769 257.112 245.795 247.928 245.016 94.794
          32 279.098 336.251 267.332 298.998 286.251 147.808
          64 304.806 381.515 321.722 339.217 318.546 207.577
          128 316.745 433.474 339.159 356.571 337.731 255.225
          256 331.208 454.923 315.551 374.849 342.356 291.948
          512 333.462 451.160 351.006 374.545 344.597 312.074
          1024 332.338 462.433 350.361 375.159 359.086 324.510
          2048 329.805 472.755 361.305 379.140 338.142 317.101
          4096 326.613 466.729 349.653 345.593 337.487 317.728
          8192 313.368 458.838 357.077 384.962 348.174 332.353
          16384 338.060 448.738 341.744 371.819 337.915 295.606
          32768 301.724 451.107 346.410 381.720 322.610 332.183
          65536 337.599 472.797 328.112 383.809 324.456 336.955
          131072 338.017 471.498 351.307 383.234 350.165 338.728
          262144 338.063 472.881 351.411 383.875 350.652 338.079
          524288 338.175 471.574 349.477 381.680 348.984 334.452
          1048576 333.453 460.829 343.706 381.459 346.565 334.384
          2097152 335.291 465.923 347.260 374.896 330.102 330.254
          4194304 332.436 460.711 340.488 378.804 346.389 334.329
          8388608 334.700 464.714 347.837 378.336 346.230 324.550
          16777216 316.521 431.736 342.807 373.638 341.928 328.748

          It seems to me that, on the 64-bit JVM, most of the implementations are within margin of error at the sizes that are most often exercised (128 to 256 bytes). On 32-bit, the NewInnerOnly wins by a reasonable amount.

          I say we commit the NewInnerOnly version.

          Show
          Todd Lipcon added a comment - Here are the results from my laptop. PureJavaTW is Tsz Wo's updated version. First, 64-bit JVM: num bytes NewLoopOnly MB/sec NewInnerOnly MB/sec NewPureJava MB/sec PureJava MB/sec PureJavaTW MB/sec Native MB/sec 1 79.721 87.113 111.889 112.642 88.684 9.001 2 91.820 130.441 130.855 133.645 131.948 17.536 4 167.315 223.087 198.984 243.936 198.387 33.488 8 188.153 254.671 243.899 248.944 242.746 59.258 16 311.118 353.191 350.826 327.231 333.589 99.213 32 401.696 416.339 417.406 427.676 408.610 148.742 64 499.747 483.148 445.437 472.814 487.970 208.478 128 530.055 520.043 505.709 513.645 457.080 253.756 256 489.407 541.459 523.867 547.794 510.867 283.360 512 561.871 528.383 528.368 553.071 530.421 300.134 1024 579.227 549.401 537.941 551.488 536.391 293.307 2048 586.443 551.685 540.289 564.328 539.412 319.766 4096 608.470 573.333 560.746 586.014 560.349 332.661 8192 590.123 554.975 543.089 424.834 517.325 322.192 16384 583.385 539.704 542.656 567.484 542.567 324.026 32768 583.592 551.508 533.585 561.559 529.811 321.858 65536 584.476 553.217 537.679 544.507 512.978 310.739 131072 548.941 529.097 534.430 564.858 533.955 324.626 262144 584.379 551.733 536.386 564.063 479.038 324.117 524288 583.262 553.893 536.770 563.518 532.404 324.924 1048576 581.947 550.572 533.850 561.049 512.846 294.452 2097152 566.543 534.695 484.256 551.693 527.730 320.850 4194304 569.545 537.748 520.731 547.608 522.762 318.084 8388608 593.932 563.233 530.600 571.098 545.905 310.115 16777216 573.095 560.361 545.069 576.036 529.475 331.984 And 32-bit JVM on the same machine: num bytes NewLoopOnly MB/sec NewInnerOnly MB/sec NewPureJava MB/sec PureJava MB/sec PureJavaTW MB/sec Native MB/sec 1 56.832 51.138 60.116 60.826 47.624 7.612 2 84.111 82.710 83.672 81.422 82.630 15.075 4 167.204 138.900 147.189 163.798 147.762 30.063 8 175.720 177.396 184.865 180.241 186.981 54.772 16 227.769 257.112 245.795 247.928 245.016 94.794 32 279.098 336.251 267.332 298.998 286.251 147.808 64 304.806 381.515 321.722 339.217 318.546 207.577 128 316.745 433.474 339.159 356.571 337.731 255.225 256 331.208 454.923 315.551 374.849 342.356 291.948 512 333.462 451.160 351.006 374.545 344.597 312.074 1024 332.338 462.433 350.361 375.159 359.086 324.510 2048 329.805 472.755 361.305 379.140 338.142 317.101 4096 326.613 466.729 349.653 345.593 337.487 317.728 8192 313.368 458.838 357.077 384.962 348.174 332.353 16384 338.060 448.738 341.744 371.819 337.915 295.606 32768 301.724 451.107 346.410 381.720 322.610 332.183 65536 337.599 472.797 328.112 383.809 324.456 336.955 131072 338.017 471.498 351.307 383.234 350.165 338.728 262144 338.063 472.881 351.411 383.875 350.652 338.079 524288 338.175 471.574 349.477 381.680 348.984 334.452 1048576 333.453 460.829 343.706 381.459 346.565 334.384 2097152 335.291 465.923 347.260 374.896 330.102 330.254 4194304 332.436 460.711 340.488 378.804 346.389 334.329 8388608 334.700 464.714 347.837 378.336 346.230 324.550 16777216 316.521 431.736 342.807 373.638 341.928 328.748 It seems to me that, on the 64-bit JVM, most of the implementations are within margin of error at the sizes that are most often exercised (128 to 256 bytes). On 32-bit, the NewInnerOnly wins by a reasonable amount. I say we commit the NewInnerOnly version.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Thanks, Scott and Todd for checking the performance in such great details!

          > I say we commit the NewInnerOnly version.
          +1

          BTW, the path for TestPureJavaCrc32 is wrong in the previous patch. We don't have "core" directory anymore.

          --- /dev/null
          +++ src/test/core/org/apache/hadoop/util/TestPureJavaCrc32.java
          @@ -0,0 +1,94 @@
          
          Show
          Tsz Wo Nicholas Sze added a comment - Thanks, Scott and Todd for checking the performance in such great details! > I say we commit the NewInnerOnly version. +1 BTW, the path for TestPureJavaCrc32 is wrong in the previous patch. We don't have "core" directory anymore. --- /dev/ null +++ src/test/core/org/apache/hadoop/util/TestPureJavaCrc32.java @@ -0,0 +1,94 @@
          Hide
          Todd Lipcon added a comment -

          Good catch on the patch. I'll prepare a new one shortly that fixes that and includes the NewInnerOnly implementation, unless you're already on it.

          Show
          Todd Lipcon added a comment - Good catch on the patch. I'll prepare a new one shortly that fixes that and includes the NewInnerOnly implementation, unless you're already on it.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Since the benchmark program is very useful, we may combine TestCrc32Performance with TestPureJavaCrc32. Then, the benchmark can be executed by something like "java TestPureJavaCrc32" in the future.

          Todd, could you post a new patch?

          Show
          Tsz Wo Nicholas Sze added a comment - Since the benchmark program is very useful, we may combine TestCrc32Performance with TestPureJavaCrc32. Then, the benchmark can be executed by something like "java TestPureJavaCrc32" in the future. Todd, could you post a new patch?
          Hide
          Scott Carey added a comment -

          Interesting results.
          Todd, were you using -server on the 32 bit version? If its Windows, client is the default, if its Linux, server is, and on a Mac 32 bit = Java 1.5.

          In this benchmark, the Mac Java 1.5 32 bit JVM has equal performance between the previous version and the InnerOnly version – but slower than native except for sizes 128 and under – so its a somewhat useless test for the expected usage of this code.

          I wasn't able to run against the newest 32 bit Sun JVM on a Linux server or to try out -client. There is no -client for the 64 bit Sun JVMs.

          Show
          Scott Carey added a comment - Interesting results. Todd, were you using -server on the 32 bit version? If its Windows, client is the default, if its Linux, server is, and on a Mac 32 bit = Java 1.5. In this benchmark, the Mac Java 1.5 32 bit JVM has equal performance between the previous version and the InnerOnly version – but slower than native except for sizes 128 and under – so its a somewhat useless test for the expected usage of this code. I wasn't able to run against the newest 32 bit Sun JVM on a Linux server or to try out -client. There is no -client for the 64 bit Sun JVMs.
          Hide
          Scott Carey added a comment -

          It seems to me that, on the 64-bit JVM, most of the implementations are within margin of error at the sizes that are most often exercised (128 to 256 bytes).

          What are the most common use cases, and where else should this code be used other than HDFS? For HDFS, the default checksum block size is 512 bytes. For the bzip2 code, it is using its own CRC32 – perhaps that should change. For any .zip file compression or decompression, I'm not sure what the typical use case is.

          Show
          Scott Carey added a comment - It seems to me that, on the 64-bit JVM, most of the implementations are within margin of error at the sizes that are most often exercised (128 to 256 bytes). What are the most common use cases, and where else should this code be used other than HDFS? For HDFS, the default checksum block size is 512 bytes. For the bzip2 code, it is using its own CRC32 – perhaps that should change. For any .zip file compression or decompression, I'm not sure what the typical use case is.
          Hide
          Owen O'Malley added a comment -

          Since map/reduce uses CRC to check the spills, the size can be arbitrary length.

          Show
          Owen O'Malley added a comment - Since map/reduce uses CRC to check the spills, the size can be arbitrary length.
          Hide
          Todd Lipcon added a comment -

          Attached is an updated patch. The changes here from the previous version:

          • Replaced the implementation with the NewInner version by Scott.
          • Cleaned up the Performance Test a bit and included it as a static inner class of TestPureJavaCrc32 - it can be run using java 'org.apache.hadoop.util.TestPureJavaCrc32$PerformanceTest'
          • The patch also includes changes to ChecksumFileSystem and util.DataChecksum

          Note that I did not move the test out of src/test/core - it looks like src/test/core still exists in the common repository - all of the tests for common are still in there.

          As for testing, I reran the unit test, temporarily modified to do several million trials and it passed. I also ran TestChecksumFileSystem and it passed. I re-ran the performance test and made sure it was consistent with the results we've been seeing all along.

          After this is committed, there will be a small patch to HDFS and a small patch to MapReduce to sub out PureJavaCrc32 for java.util.zip.CRC32.

          Show
          Todd Lipcon added a comment - Attached is an updated patch. The changes here from the previous version: Replaced the implementation with the NewInner version by Scott. Cleaned up the Performance Test a bit and included it as a static inner class of TestPureJavaCrc32 - it can be run using java 'org.apache.hadoop.util.TestPureJavaCrc32$PerformanceTest' The patch also includes changes to ChecksumFileSystem and util.DataChecksum Note that I did not move the test out of src/test/core - it looks like src/test/core still exists in the common repository - all of the tests for common are still in there. As for testing, I reran the unit test, temporarily modified to do several million trials and it passed. I also ran TestChecksumFileSystem and it passed. I re-ran the performance test and made sure it was consistent with the results we've been seeing all along. After this is committed, there will be a small patch to HDFS and a small patch to MapReduce to sub out PureJavaCrc32 for java.util.zip.CRC32.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12413728/hadoop-6148.txt
          against trunk revision 793987.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          -1 javadoc. The javadoc tool appears to have generated 1 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/577/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/577/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/577/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/577/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12413728/hadoop-6148.txt against trunk revision 793987. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/577/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/577/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/577/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/577/console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          Attached is a slightly updated patch which fixes the javadoc warning (there was an @inheritDoc on a constructor, which isn't allowed)

          Show
          Todd Lipcon added a comment - Attached is a slightly updated patch which fixes the javadoc warning (there was an @inheritDoc on a constructor, which isn't allowed)
          Hide
          Tsz Wo Nicholas Sze added a comment -
          diff --git a/src/java/org/apache/hadoop/fs/ChecksumFileSystem.java b/src/java/...
          index 72a09bd..6f9701e 100644
          --- a/src/java/org/apache/hadoop/fs/ChecksumFileSystem.java
          +++ b/src/java/org/apache/hadoop/fs/ChecksumFileSystem.java
          

          There are "a" and "b" directories in the patch. Hudson probably cannot handle it.

          +1 Patch looks good other than that.

          > BTW, the path for TestPureJavaCrc32 is wrong in the previous patch. We don't have "core" directory anymore.
          I was wrong about it. We still have "core" for the tests. See also HADOOP-6159.

          Show
          Tsz Wo Nicholas Sze added a comment - diff --git a/src/java/org/apache/hadoop/fs/ChecksumFileSystem.java b/src/java/... index 72a09bd..6f9701e 100644 --- a/src/java/org/apache/hadoop/fs/ChecksumFileSystem.java +++ b/src/java/org/apache/hadoop/fs/ChecksumFileSystem.java There are "a" and "b" directories in the patch. Hudson probably cannot handle it. +1 Patch looks good other than that. > BTW, the path for TestPureJavaCrc32 is wrong in the previous patch. We don't have "core" directory anymore. I was wrong about it. We still have "core" for the tests. See also HADOOP-6159 .
          Hide
          Todd Lipcon added a comment -

          Woops! Forgot to do --no-prefix on my git-diff. Same patch, just without the a/ and b/ prefixes.

          Show
          Todd Lipcon added a comment - Woops! Forgot to do --no-prefix on my git-diff. Same patch, just without the a/ and b/ prefixes.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12413763/hadoop-6148.txt
          against trunk revision 793987.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/578/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/578/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/578/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/578/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12413763/hadoop-6148.txt against trunk revision 793987. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/578/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/578/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/578/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/578/console This message is automatically generated.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          I have committed this. Thanks, Todd and Scott!

          Show
          Tsz Wo Nicholas Sze added a comment - I have committed this. Thanks, Todd and Scott!
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk #29 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk/29/)
          . Implement a fast, pure Java CRC32 calculator which outperforms java.util.zip.CRC32. Contributed by Todd Lipcon and Scott Carey

          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk #29 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk/29/ ) . Implement a fast, pure Java CRC32 calculator which outperforms java.util.zip.CRC32. Contributed by Todd Lipcon and Scott Carey
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Got some ideas to improve PureJavaCrc32. See HADOOP-6166.

          Show
          Tsz Wo Nicholas Sze added a comment - Got some ideas to improve PureJavaCrc32. See HADOOP-6166 .

            People

            • Assignee:
              Scott Carey
              Reporter:
              Owen O'Malley
            • Votes:
              0 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development