|
[
Permlink
| « Hide
]
Tsz Wo (Nicholas), SZE added a comment - 23/Jul/09 01:43 AM
c6166_20090722.patch: Tried a few CRC32 implementations.
c6166_20090722_benchmark_64VM.txt, c6166_20090722_benchmark_32VM.txt: Tested the implementations on both 32-bit and 64-bit VMs. Seems that Crc32_4_3 are faster than current PureJavaCrc32 in both cases.
(Sorry that I still have not tested with TestPureJavaCrc32.PerformanceTest.) One could take Intel's idea, and go 8 (or 6 or 12?) bytes at a time instead of 4. This would line up with the 12 bit lookup tables a bit better. 6 bytes might be the easier boundary, and require 4 12 bit lookup tables. which is 32K, the size of the tables in the inner loop in your "4_3" are 17K, and the current version is 4K.
Going 12 bytes at a time would require 8 tables and 64K of space, and then we're randomly jumping around lookup tables that don't fit in a L1 D-cache on some processors. The 8 bytes at a time approach is in C code (BSD license) here: The trick with going over 4 bytes in a loop has to do with how the cycle works on the CRC. I think, after the first four bytes of lookups, it changes a bit. I think a 6 byte at a time, 4 lookup of 12 bit LUT would be like this: public class Crc32_6_4 extends Crc32Base { /** {@inheritDoc} */ public void update(byte[] b, int off, int len) { while(len > 3) { crc ^= b[off++] & 0xff; crc ^= (b[off++] & 0xff) << 8; crc ^= (b[off++] & 0xff) << 16; crc ^= (b[off++] & 0xff) << 24; int c0 = b[off++] & 0xff;; c0 ^= (b[off++] & 0xff) << 8; crc = T12_3[crc & 0xfff] ^ T12_2[(crc >>> 12) & 0xfff] ^ T12_1[((crc >>> 24) & (c0 << 8)) & 0xfff] ^ T12_0[c0 >> 4]; len -= 6; } for (; len > 0; len--) { crc = (crc >>> 8) ^ Table8_0[(crc ^ b[off++]) & 0xff]; } } Assuming the tables were built right and the "wrap past 4 bytes means don't xor with crc" is correct. I haven't tried this at all. All of these with larger lookup tables run the risk of performing worse under concurrency, even if faster single threaded. Cache pressure is greater under concurrency. We might want to use the benchmark in For reference, Intel's C code (referenced above, snippet below) with 8 tables looks like this in the inner loop: /*++ * * Copyright (c) 2004-2006 Intel Corporation - All Rights Reserved * * This software program is licensed subject to the BSD License, * available at http://www.opensource.org/licenses/bsd-license.html * * Abstract: The main routine * --*/ crc32c_sb8_64_bit( uint32_t* p_running_crc, const uint8_t* p_buf, const uint32_t length, const uint32_t init_bytes, uint8_t mode) { uint32_t li; uint32_t crc, term1, term2; uint32_t running_length; uint32_t end_bytes; if(mode == MODE_CONT) crc = *p_running_crc; else crc = CRC32C_INIT_REFLECTED; running_length = ((length - init_bytes)/8)*8; end_bytes = length - init_bytes - running_length; for(li=0; li < init_bytes; li++) crc = crc_tableil8_o32[(crc ^ *p_buf++) & 0x000000FF] ^ (crc >> 8); for(li=0; li < running_length/8; li++) { crc ^= *(uint32_t *)p_buf; p_buf += 4; term1 = crc_tableil8_o88[crc & 0x000000FF] ^ crc_tableil8_o80[(crc >> 8) & 0x000000FF]; term2 = crc >> 16; crc = term1 ^ crc_tableil8_o72[term2 & 0x000000FF] ^ crc_tableil8_o64[(term2 >> 8) & 0x000000FF]; term1 = crc_tableil8_o56[(*(uint32_t *)p_buf) & 0x000000FF] ^ crc_tableil8_o48[((*(uint32_t *)p_buf) >> 8) & 0x000000FF]; term2 = (*(uint32_t *)p_buf) >> 16; crc = crc ^ term1 ^ crc_tableil8_o40[term2 & 0x000000FF] ^ crc_tableil8_o32[(term2 >> 8) & 0x000000FF]; p_buf += 4; } for(li=0; li < end_bytes; li++) crc = crc_tableil8_o32[(crc ^ *p_buf++) & 0x000000FF] ^ (crc >> 8); if((mode == MODE_BEGIN) || (mode == MODE_CONT)) return crc; return crc ^ XOROT; } That is pretty straightforward to implement if the next 4 tables (T5, T6, T7, T8 in the terminology of the current PureJavaCRC32) are generated the same way as the previous 4. crc ^= *(uint32_t *)p_buf I may try out the eight bytes at once with 8 lookup tables version next week. Yeah, the speed depends on many platform dependent details like cache size, 32-bit/64-bit, etc. So reducing the number of operations in the CRC algorithm may not lead to a better performance.
I tried more varieties like Crc32_6_4 but my implementation did not perform well. We should run benchmark on Scott's. Unfortunately, Crc32_4_3 only wins on a 32-bit vm over TestPureJavaCrc32.PerformanceTest but not 64-bit vm.
Not yet able to improve PureJavaCrc32 in my 64-bit machine but had a lot of fun last weekends.
c6166_20090727.patch: moved the codes to common (finally). Please try it when you have time.
Got a different story after updated to the latest jdk.
java.version = 1.6.0_14
The above table makes more sense since it is easy to tell from the codes that Crc32_N_N for N > 4 is more efficient than PureJavaCrc32 (i.e. Crc32_4_4). Note that N cannot be increased arbitrary. Otherwise, the tables may not fit into the cpu cache as explained previously by Scott. (Tried Crc32_16_16 but it got worst.) As shown above, Crc32_12_12 has 7% and 26% improvement on my 64-bit and 32-bit machines with jdk 1.6.0_14-b08, respectively. I cannot explain why the numbers were generally better in 1.6.0_10-b33, 64-bit vm. Specific jdk feature/bug? For the 32 bit results, try passing -server on the command line. It behaves quite differently with loop unrolling and certain low level optimizations in the JIT versus -client (which is only default on 32 bit windows, and anyone who would run Hadoop there and wanted better performance would pass -server to speed it up).
Are you specifying a -Xmx memory value? What about -Xms? On windows with -client, the VM has unusual default memory and GC values, I've found that setting its NewRatio more like the other platforms helps a lot: -XX:NewRatio=4 or something like that may make your results more consistent across the platforms (and faster on 32 bit windows). On my environment, on the previous set of tests, changing from _10 to _12 to _14 on JDK6 did not seem to do much. But I was manually setting -Xmx512m for all of my tests. I can try again later, but there is something odd about the results slowing down so much on the 1.6.0_14 version. It is also curious that the PureJavaCrc32New – which only changes the loop style --also slows down but not as much as the older PureJavaCrc32 and goes from always about 15% slower to a little bit faster. My guess is something configuration related has changed with respect to some default JVM settings. I think there may be some improvement possible in the 8_8 case in how the 9 XORs at the end are done. Perhaps all in one line? or in 3 sets of 3? Or more likely the compiler is smart enough to do the register optimization itself? Perhaps not, Intel's C code even avoids a single line with more than 4 XORs at once for some reason. c6166_20090728.patch: included Crc32_16_16
> For the 32 bit results, try passing -server on the command line. ... java.version = 1.6.0_14
> Are you specifying a -Xmx memory value? What about -Xms? > It is also curious that the PureJavaCrc32New - which only changes the loop style ... > I think there may be some improvement possible in the 8_8 case in how the 9 XORs at the end are done. ... Thanks, Scott. >> I think there may be some improvement possible in the 8_8 case in how the 9 XORs at the end are done. ...
>Yeah, we should try. c6166_20090810.patch: tried various xor schemes for 8_8 and 16_16. Seems Crc32_8_8d is the best choice. I will generate a patch to replace current PureJavaCrc32 unless someone would like to run some benchmarks.
Performance Table (The unit is MB/sec)
Performance Table (The unit is MB/sec)
We probably want to have Todd's concurrency test from
We might also want to try the old four at a time code for 4 <= len < 8. We should also confirm the results on one of the other systems we tested in the past. I won't be able to do that for a couple days, but it should be easy then. +public class Crc32_8_8e extends Crc32Base { public void update(byte[] b, int off, int len) { while(len > 7) { int c0 = b[off++] ^ crc; int c1 = b[off++] ^ (crc >>>= 8); int c2 = b[off++] ^ (crc >>>= 8); int c3 = b[off++] ^ (crc >>>= 8); crc = T8_7[c0 & 0xff] ^ T8_6[c1 & 0xff] ^ T8_5[c2 & 0xff] ^ T8_4[c3 & 0xff]; // three xors crc ^= T8_3[b[off++] & 0xff] ^ T8_2[b[off++] & 0xff]; // two xors crc ^= T8_1[b[off++] & 0xff] ^ T8_0[b[off++] & 0xff]; // two xors len -= 8; } while(len > 0) { crc = (crc >>> 8) ^ T8_0[(crc ^ b[off++]) & 0xff]; len--; } } } c6166_20090811.patch: added Crc32_8_8e and deleted some old classes.
> We probably want to have Todd's concurrency test from > We might also want to try the old four at a time code for 4 <= len < 8. > We should also confirm the results on one of the other systems we tested in the past. I won't be able to do that for a couple days, but it should be easy then. > How many other variants did you try? Intel's C code does some strange things to group CRC's by 3. ... Included Crc32_8_8e below. Crc32_8_8d still seems the best choice.
Performance Table (The unit is MB/sec)
Performance Table (The unit is MB/sec)
Yep, we're doing sprint planning tomorrow so I'll make sure to block off a couple hours to run benchmarks for this. Attaching PDF of the concurrency benchmark from
I've normalized all of the results as percentages of the very fastest. It looks like they're all within 5-10% of each other for this benchmark, and with the error bars it's pretty obvious that there is no clear winner for the 1-byte-write case at any concurrency level. If anything Crc32_6_6 seems to come out a little ahead, but really barely discernible. I'd upload the script I used to run the benchmarks, but I had to add a silly little patch to Hadoop to be able to switch between implementations easily, so it's a hassle to run. For reference, this system is a Nehalem box running 64-bit JDK 1.6.0u14, 8 cores w/ hyperthreading (16 logical). Oops, excuse me - off-by-one error in my R script for matching up colors to datasets. Same results, except the one that appears to do the best is 8_8b
Thanks, Todd.
8_8b is also best in my benchmark ran on Windows. Question: Crc32Base and Crc32Table are not crc implementations. What do they mean in the graph? Oops - I had some old data files lying around in my "out" directory from a previous run where I had a mistake in my code. I've removed them and regenerated the graphs here.
I'm currently running CRC32 benchmarks without concurrency and will upload those when they finish. Looks like the benchmark has run long enough to get good data. Here are the benchmarks from TestPureJavaCrc32 run on three different test systems. nehalem32 is the same nehalem box (3MB L2 cache) running a 32-bit JVM. nehalem64 is that box with a 64-bit JVM. "laptop" is my MacBook Pro (Core 2 duo) running a 64-bit JVM.
Each PDF has several pages:
I ran the whole benchmark suite 50+ times to generate the error bars. Hopefully they'll serve as a good visual indicator for where the differences are actually statistically significant. In summary, here's how I interpret the data:
So, I think the next step here is to profile a couple of MR applications to see what sizes are most common. My personal opinion is that we should target the 64-bit Nehalem architecture and the 128-byte size range. This would point to the 8_8d implementation as the winner. > ... This would point to the 8_8d implementation as the winner.
I agree since Crc32_8_8d is the best on 64-bit platforms.
+1, though I just realized I didn't run my 32-bit JVM tests with -server. I should probably run that this afternoon before we make a final decision. Ran the same benchmarks with java -d32 -server and the results are indistinguishable (I guess since the benchmarks run a long time the client VM does just as much JIT as the server VM).
So, pending anyone coming up with evidence that we should fine tune for the really-small-checksum case on processors like my laptop, 8_8d wins by enough of a margin it seems worth doing. Great test results!
The conclusions I draw from these:
The 32 bit JVM likes 8_8b, 64 bit 8_8d. I agree that d is the winner here. Because the '8' variants shift to one byte at a time if the input is less than 8 bytes, they perform worse than the old PureJavaCrc32 at the 4 byte to 7 byte level. Is this important? It would be useful to know how often the crc code is called on small byte chunks. We can get this to near PureJavaCrc32 speeds for 4 byte sizes if we add a four byte at a time block to 8_8d. That is, we can go to something of the form: while(len > 7) { < 8 at a time code here > } while(len > 3) { < 4 at a time code here> } while(len > 0) { < 1 at a time code here> } Whether that is important depends on the use cases and frequency of requests in the 4-7 and 12-15 byte range. We can even have a block for two-at a time optimization. Personally, I'm satisfied at this point
Regarding the importance of the different sizes, I think the next step is to instrument the CRC32 to log every invocation to a file and run some single-node wordcounts or something. With a histogram of the results it should be easy to see whether the small-buffer case matters at all in real scenarios. > Because the '8' variants shift to one byte at a time if the input is less than 8 bytes, they perform worse than the old PureJavaCrc32 at the 4 byte to 7 byte level. Is this important? It would be useful to know how often the crc code is called on small byte chunks. We can get this to near PureJavaCrc32 speeds for 4 byte sizes if we add a four byte at a time block to 8_8d.
I tried this before but it did not help. BTW, the while-loop in the middle should be an if-statement since we have len <= 7. > I tried this before but it did not help. ...
I should be more clear: it did help for the 4-byte case but it hurt the performance in some other cases. c6166_20090819.patch: added Crc32_8_8b2 and Crc32_8_8d2. See whether anyone wants to play with them. Here is my results.
Performance Table (The unit is MB/sec)
Performance Table (The unit is MB/sec)
c6166_20090819review.patch: patch for reviewing.
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12417054/c6166_20090819review.patch against trunk revision 804918. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/615/testReport/ This message is automatically generated. +1
Very impressive work. Thanks Nicholas, Todd, and Scott! Integrated in TestBuilds #6 (See http://hudson.zones.apache.org/hudson/job/TestBuilds/6/
. Further improve the performance of the pure-Java CRC32 implementation. Contributed by Tsz Wo (Nicholas), SZE Integrated in Hadoop-Common-trunk #70 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk/70/
. Further improve the performance of the pure-Java CRC32 implementation. Contributed by Tsz Wo (Nicholas), SZE |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||