|
[
Permlink
| « Hide
]
Steve Loughran added a comment - 24/Feb/09 02:13 PM
is this 64-bit java on 64-bit OS? If so, surprising -atomic long operations should be atomic at the x86 opcode level, with minimal contention
Yes it is. Looking at the JDK source, they do CAS rather than an xadd instruction:
public final long getAndAdd(long delta) { So there can easily be contention if this is executed frequently. I think the best path here may be to make sure that buffering is pulled up a few layers of abstraction. I've noticed the FSStats stuff taking large amounts of CPU time in profiles of our mappers and reducers. I'm not sure why it sucks up so much cpu, but I'd definitely like to see a way to negate this effect.
A quick hack to make the output path buffered – it'd be nice to see if this helps some real world applications. The input side of this is a bit trickier, still working on it.
Ben Maurer made changes - 24/Feb/09 06:09 PM
Updated version of the patch. Handles the read and write path. Haven't benchmarked reads yet, but on writes we get the following improvements (1 byte key/values):
Ben Maurer made changes - 24/Feb/09 07:34 PM
Inserting a 256k output buffer in every FSDataOutputStream is probably not good. Each FileSystem implementation internally buffers already. Adding a big new buffer in front increases the memory used and also means that data is in memory longer before it is checksummed. I also suspect even a 100b buffer would be enough, since, at the top-level, much i/o is byte-by-byte. But adding a new small buffer would increase the times data is copied, which we should also avoid.
So I'd suggest that, rather than adding a buffer, PositionCache can just be lazy about reporting statistics. We can add code like: private static final int REPORT_INTERVAL = 1024; private int unreported; private void incrementBytesWritten(int bytesWritten) { unreported += bytesWritten; if (unreported > REPORT_INTERVAL) reportBytesWritten(); } private reportBytesWritten() { statistics.incrementBytesWritten(unreported); unreported = 0; } Then call incrementBytesWritten() in the write() methods and add a call to reportBytesWritten() to close(). Does that make sense? When I benchmarked, I saw a performance gain from not going into the CRC routines, etc for each byte reported. Also, I did see some level of gains for using a larger IO buffer (though I haven't tested that fully).
+1 - I like that approach rather than adding a buffer. If the buffer increases performance for CRC, etc, that should be a separate JIRA. > So I'd suggest that, rather than adding a buffer, PositionCache can just be lazy about reporting statistics.
+1. I agree. Implemented the lazy reporting of bytes as Doug suggested.
I didn't see any particular speedup, but I'm working on my dual-core laptop at the moment
Todd Lipcon made changes - 14/Apr/09 01:43 AM
Todd Lipcon made changes - 14/Apr/09 01:44 AM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12405373/TestWriteConcurrency.java against trunk revision 765025. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/196/console This message is automatically generated. Looks like the Hadoop QA bot attempted to apply the .java file as a patch... not sure how to convince it to apply the patch file.
The hadoopqa patch proceses always picks the last attached file. In this case, this was the java file. Please re-attach the latest and greatest patch to this JIRA and then cancel and submit patch once again. Thanks.
Todd Lipcon made changes - 15/Apr/09 08:30 PM
Dhruba: I don't appear to have permissions to twiddle the "Patch Available" state on this issue. Hopefully someone else can retrigger QA
Todd: I added you as "contributors". Please see if you are able to Cancel and then Submit the patch
Todd Lipcon made changes - 15/Apr/09 08:48 PM
Yep, that worked. Thanks Dhruba
Todd Lipcon made changes - 15/Apr/09 08:48 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12405569/hadoop-5318.txt against trunk revision 772482. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/295/testReport/ This message is automatically generated. The test failure reported by Hudson is on the capacity scheduler contrib (unrelated)
Given that this is a performance-related patch, I'd like to hear back from Ben that the patch to be committed shows similar performance gains to the original patch. When I tested stuff out, I got a boost from doing buffering that can't be replicated with just the grouping of atomic increments – however, if we're just going to go with the simple version, this patch is as good as it gets.
A similar patch is needed for input streams. Ben: sorry, I wasn't clear from your last comment - did you try out the newest patch from this issue, or just commenting on what you saw on the original patch?
If you don't have a chance to try it, I can fire it up on an 8 core box somewhere and see what I get..
Todd Lipcon made changes - 01/Jun/09 04:16 PM
I just tried this patch on an EC2 c1.xlarge instance (8 cores) and couldn't reproduce the expected performance improvements. I updated the test program a bit to run 10 trials and run System.gc in between each (since the first couple trials seem to speed up due to JITting). I was writing into /dev/shm so actual IO performance shouldn't be a factor - just the contention on the statistics lock. I also changed the test program output format to be suitable for loading into R for analysis. Here's the t-test which fails to show a significant improvement (taking 20:80 to chop off the first 2 runs where JIT happened):
> d.0k <- read.table(file="0k.tsv",header=T)
> d.1k <- read.table(file="1k.tsv",header=T)
> t.test((d.1k$rate - d.0k$rate)[20:80])
One Sample t-test
data: (d.1k$rate - d.0k$rate)[20:80]
t = -0.6754, df = 60, p-value = 0.502
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-68205.16 33772.57
sample estimates:
mean of x
-17216.30
The p-value = 0.502 is pretty unconvincing. Any thoughts from the various parties who are seeing this contention in practice? Here's the updated test code that does more trials, gcs, etc
Todd Lipcon made changes - 14/Jun/09 03:49 AM
Just tried two more situations with the test code. The first was to comment out all of the increments in PositionCache, and again saw no improvement. I also tried Ben's patch "buf.patch" as well with the same test code. The performance increase is quite pronounced:
> d.benm <- read.table(file="benm-patch.tsv", header=T)
> t.test(d.benm$rate - d.0k$rate)
One Sample t-test
data: d.benm$rate - d.0k$rate
t = 32.835, df = 79, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
2925074 3302593
sample estimates:
mean of x
3113833
So, this JIRA should focus on figuring out how we can get that same benefit while addressing Doug's concerns above. It seems like the real culprit here is HADOOP-5598. Switching to the pure java CRC32 makes the benchmark scale nearly linearly to 8 threads.
Attaching a benchmark and the resulting box plots. This benchmark writes 1M pairs of <ByteWritable, ByteWritable> into a SequenceFile with varying number of threads. The three series are:
green: Before The results clearly show that the CRC implementation made the majority of the difference. Adding this patch seems to result in a very slight improvement. Here's the R script I used to generate the plot: d.before.crc32 <- read.table(file="results_before_crc32.tsv", header=F) d.before <- read.table(file="results_before.tsv", header=F) d.after <- read.table(file="results_after.tsv", header=F) lims <- c(1000000, 6400000) log <- "" /* change to "y" to get log scale */ boxplot(V2 ~ V1, d.before.crc32, col="green", ylim=lims, at=(1:8)-0.4, log=log, boxwex=0.2) boxplot(V2 ~ V1, d.before, col="red", ylim=lims, at=(1:8)-0.2,boxwex=0.2, log=log, add=T) boxplot(V2 ~ V1, d.after, col="blue", ylim=lims, at=1:8+0.2,boxwex=0.2, log=log, add=T) legend(legend=c("before CRC32", "crc, no 5318", "5318+crc"), fill=c("green", "red", "blue"), x="topright",col="black") Ran the benchmarks on: Java(TM) SE Runtime Environment (build 1.6.0_14-b08) Machine is a Nehalem box, dual quad core with hyperthreading (/proc/cpuinfo shows 16 CPUs: Intel(R) Xeon(R) CPU X5570 @ 2.93GHz)
Todd Lipcon made changes - 22/Jul/09 07:30 PM
Since the CRC code reaches to near peak throughput (90%) at about 128 or 256 byte chunks, a small buffer may be beneficial. Copying extra data is cheap for this size (2 to 4 cache lines – the copy is in L1 cache), and extra delay before the CRC is also not very risky since its in L1 cache very briefly, not sitting off CPU in RAM. On the other hand, if all clients are buffering by at least this size, it is irrelevant. Writing tiny chunks however will impact throughput as the CRC per-call overhead adds up.
See Scott: I think if we add buffering it should be at the "client" level - eg in SequenceFileOutputStream or anywhere else that very small records will be written at a high rate. Doing so in the FSDataOutputStream seems heavy-handed, since we don't know what size the writes will usually be, and I'd imagine the usual is much larger than this pathological test case we used for the benchmark.
I'm also not sure I see your point about the L1 cache - in between the small writes there may be arbitrary amounts of computation (eg in the mapper function) that would evict the buffer from cache.
Scott Carey made changes - 22/Jul/09 09:37 PM
This performance issue is largely (but not completely) fixed by
Tsz Wo (Nicholas), SZE made changes - 23/Jul/09 06:31 PM
Given that we got the majority of speedup out of the CRC changes, I think we should resolve this ticket as wontfix. I just ran the same benchmark of trunk vs patched on my dual core laptop and the performance was essentially the same. On the Nehalem box it looked like there might have been a few percent speedup, but it seems not-worth-it for the added complexity.
If anyone disagrees (or has a benchmark to back up this patch), please speak up. I'll upload an updated patch against trunk so people can give it a try.
Todd Lipcon made changes - 24/Jul/09 07:43 PM
Resolving as "won't fix" since the problem seems to have been solved by the recent CRC32 improvements.
Todd Lipcon made changes - 09/Sep/09 07:14 AM
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||