Issue Details (XML | Word | Printable)

Key: HADOOP-5318
Type: Bug Bug
Status: Resolved Resolved
Resolution: Won't Fix
Priority: Major Major
Assignee: Todd Lipcon
Reporter: Ben Maurer
Votes: 0
Watchers: 18
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

Poor IO Performance due to AtomicLong operations

Created: 24/Feb/09 04:29 AM   Updated: 09/Sep/09 07:14 AM
Return to search
Component/s: None
Affects Version/s: 0.19.0
Fix Version/s: None

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works buf.patch 2009-02-24 07:34 PM Ben Maurer 5 kB
Text File Licensed for inclusion in ASF works buffer-output.patch 2009-02-24 06:09 PM Ben Maurer 0.9 kB
Text File Licensed for inclusion in ASF works hadoop-5318.txt 2009-07-24 07:43 PM Todd Lipcon 2 kB
Text File Licensed for inclusion in ASF works hadoop-5318.txt 2009-04-15 08:30 PM Todd Lipcon 2 kB
Text File Licensed for inclusion in ASF works hadoop-5318.txt 2009-04-14 01:44 AM Todd Lipcon 2 kB
Java Source File Licensed for inclusion in ASF works TestWriteConcurrency.java 2009-07-22 07:30 PM Todd Lipcon 3 kB
Java Source File Licensed for inclusion in ASF works TestWriteConcurrency.java 2009-06-14 03:49 AM Todd Lipcon 2 kB
Java Source File Licensed for inclusion in ASF works TestWriteConcurrency.java 2009-04-14 01:44 AM Todd Lipcon 2 kB
Image Attachments:

1. Rplot001.png
(29 kB)
Environment: 2x quad core xeon linux 64 bit
Issue Links:
Reference

Resolution Date: 09/Sep/09 07:14 AM


 Description  « Hide
The AtomicLong operations in counting file system statistics can cause high levels of contention with multiple threads. This test demonstrates having multiple threads writing to different sequence files:
import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.ByteWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.SequenceFile.Writer;
import org.apache.hadoop.io.SequenceFile.CompressionType;


public class Test {
	public static void main(String[] args) throws IOException {
		final Configuration c = new Configuration();
		final FileSystem fs = FileSystem.get(c);
		
		final int NUM = 1000*1000;
		for (int i = 0; i < Integer.valueOf(args[0]); i ++) {
			final int ii = i;
			new Thread(new Runnable() {
				@Override
				public void run() {
					
					try {
						Writer f = SequenceFile.createWriter(fs, c, new Path("/test/" + ii ), ByteWritable.class, ByteWritable.class, CompressionType.NONE);
						ByteWritable v = new ByteWritable();
						
						long time = System.currentTimeMillis();
						for (int i = 0; i < NUM; i ++)
							f.append(v, v);
						f.close();
						long end = System.currentTimeMillis();
						
						System.out.printf("%d opartions in %d msec. %f/second\n", NUM, end - time, (float)(1000 * NUM)/(end - time));
						
					} catch (Exception e) {
						// TODO Auto-generated catch block
						e.printStackTrace();
					}
					
				}
			}).start();
		}
	}
}

The results of this benchmark are

==== 1 threads ====
1000000 opartions in 1431 msec. 698812.000000/second
==== 2 threads ====
1000000 opartions in 3001 msec. 333222.250000/second
1000000 opartions in 2985 msec. 335008.375000/second
==== 3 threads ====
1000000 opartions in 4923 msec. 203128.171875/second
1000000 opartions in 4924 msec. 203086.921875/second
1000000 opartions in 4981 msec. 200762.906250/second
==== 4 threads ====
1000000 opartions in 6716 msec. 148898.156250/second
1000000 opartions in 7048 msec. 141884.218750/second
1000000 opartions in 7342 msec. 136202.671875/second
1000000 opartions in 7344 msec. 136165.578125/second
==== 5 threads ====
1000000 opartions in 10366 msec. 96469.226563/second
1000000 opartions in 11085 msec. 90212.000000/second
1000000 opartions in 11121 msec. 89919.968750/second
1000000 opartions in 11464 msec. 87229.585938/second
1000000 opartions in 11538 msec. 86670.132813/second
==== 6 threads ====
1000000 opartions in 16513 msec. 60558.347656/second
1000000 opartions in 17704 msec. 56484.410156/second
1000000 opartions in 18219 msec. 54887.753906/second
1000000 opartions in 18550 msec. 53908.355469/second
1000000 opartions in 18605 msec. 53748.992188/second
1000000 opartions in 18663 msec. 53581.953125/second
==== 7 threads ====
1000000 opartions in 22207 msec. 45030.847656/second
1000000 opartions in 23275 msec. 42964.554688/second
1000000 opartions in 23484 msec. 42582.183594/second
1000000 opartions in 24378 msec. 41020.593750/second
1000000 opartions in 24425 msec. 40941.656250/second
1000000 opartions in 24533 msec. 40761.421875/second
1000000 opartions in 24645 msec. 40576.183594/second
==== 8 threads ====
1000000 opartions in 26375 msec. 37914.691406/second
1000000 opartions in 26420 msec. 37850.113281/second
1000000 opartions in 26532 msec. 37690.335938/second
1000000 opartions in 26670 msec. 37495.312500/second
1000000 opartions in 29772 msec. 33588.605469/second
1000000 opartions in 29859 msec. 33490.738281/second
1000000 opartions in 30098 msec. 33224.800781/second
1000000 opartions in 30082 msec. 33242.468750/second

However, if I comment out the file system statistics increments, the benchmark improves to:

==== 1 threads ====
1000000 opartions in 1194 msec. 837520.937500/second
==== 2 threads ====
1000000 opartions in 1433 msec. 697836.687500/second
1000000 opartions in 1433 msec. 697836.687500/second
==== 3 threads ====
1000000 opartions in 1643 msec. 608642.750000/second
1000000 opartions in 1643 msec. 608642.750000/second
1000000 opartions in 1639 msec. 610128.125000/second
==== 4 threads ====
1000000 opartions in 1886 msec. 530222.687500/second
1000000 opartions in 1886 msec. 530222.687500/second
1000000 opartions in 1886 msec. 530222.687500/second
1000000 opartions in 1899 msec. 526592.937500/second
==== 5 threads ====
1000000 opartions in 2065 msec. 484261.500000/second
1000000 opartions in 2066 msec. 484027.093750/second
1000000 opartions in 2067 msec. 483792.937500/second
1000000 opartions in 2066 msec. 484027.093750/second
1000000 opartions in 2066 msec. 484027.093750/second
==== 6 threads ====
1000000 opartions in 2151 msec. 464900.031250/second
1000000 opartions in 2111 msec. 473709.156250/second
1000000 opartions in 2153 msec. 464468.187500/second
1000000 opartions in 2114 msec. 473036.906250/second
1000000 opartions in 2113 msec. 473260.781250/second
1000000 opartions in 2112 msec. 473484.843750/second
==== 7 threads ====
1000000 opartions in 2368 msec. 422297.312500/second
1000000 opartions in 2334 msec. 428449.000000/second
1000000 opartions in 2332 msec. 428816.468750/second
1000000 opartions in 2330 msec. 429184.562500/second
1000000 opartions in 2332 msec. 428816.468750/second
1000000 opartions in 2375 msec. 421052.625000/second
1000000 opartions in 2394 msec. 417710.937500/second
==== 8 threads ====
1000000 opartions in 2517 msec. 397298.375000/second
1000000 opartions in 2538 msec. 394011.031250/second
1000000 opartions in 2538 msec. 394011.031250/second
1000000 opartions in 2538 msec. 394011.031250/second
1000000 opartions in 2539 msec. 393855.843750/second
1000000 opartions in 2614 msec. 382555.468750/second
1000000 opartions in 2666 msec. 375093.781250/second
1000000 opartions in 2701 msec. 370233.250000/second


 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Steve Loughran added a comment - 24/Feb/09 02:13 PM
is this 64-bit java on 64-bit OS? If so, surprising -atomic long operations should be atomic at the x86 opcode level, with minimal contention

Ben Maurer added a comment - 24/Feb/09 05:28 PM
Yes it is. Looking at the JDK source, they do CAS rather than an xadd instruction:

public final long getAndAdd(long delta) {
while (true) { long current = get(); long next = current + delta; if (compareAndSet(current, next)) return current; }
}

So there can easily be contention if this is executed frequently.

I think the best path here may be to make sure that buffering is pulled up a few layers of abstraction.


Bryan Duxbury added a comment - 24/Feb/09 05:32 PM
I've noticed the FSStats stuff taking large amounts of CPU time in profiles of our mappers and reducers. I'm not sure why it sucks up so much cpu, but I'd definitely like to see a way to negate this effect.

Ben Maurer added a comment - 24/Feb/09 06:09 PM
A quick hack to make the output path buffered – it'd be nice to see if this helps some real world applications. The input side of this is a bit trickier, still working on it.

Ben Maurer added a comment - 24/Feb/09 07:34 PM
Updated version of the patch. Handles the read and write path. Haven't benchmarked reads yet, but on writes we get the following improvements (1 byte key/values):
1 threads 4 threads 8 threads
21% 449% 1041%

Doug Cutting added a comment - 25/Feb/09 08:08 PM
Inserting a 256k output buffer in every FSDataOutputStream is probably not good. Each FileSystem implementation internally buffers already. Adding a big new buffer in front increases the memory used and also means that data is in memory longer before it is checksummed. I also suspect even a 100b buffer would be enough, since, at the top-level, much i/o is byte-by-byte. But adding a new small buffer would increase the times data is copied, which we should also avoid.

So I'd suggest that, rather than adding a buffer, PositionCache can just be lazy about reporting statistics. We can add code like:

private static final int REPORT_INTERVAL = 1024;
private int unreported;

private void incrementBytesWritten(int bytesWritten) {
  unreported += bytesWritten;
  if (unreported > REPORT_INTERVAL)
    reportBytesWritten();
}

private reportBytesWritten() {
   statistics.incrementBytesWritten(unreported);
   unreported = 0;
}

Then call incrementBytesWritten() in the write() methods and add a call to reportBytesWritten() to close().

Does that make sense?


Ben Maurer added a comment - 25/Feb/09 09:00 PM
When I benchmarked, I saw a performance gain from not going into the CRC routines, etc for each byte reported. Also, I did see some level of gains for using a larger IO buffer (though I haven't tested that fully).

Todd Lipcon added a comment - 10/Apr/09 09:19 PM

Then call incrementBytesWritten() in the write() methods and add a call to reportBytesWritten() to close().

Does that make sense?

+1 - I like that approach rather than adding a buffer. If the buffer increases performance for CRC, etc, that should be a separate JIRA.


dhruba borthakur added a comment - 10/Apr/09 10:14 PM
> So I'd suggest that, rather than adding a buffer, PositionCache can just be lazy about reporting statistics.

+1. I agree.


Todd Lipcon added a comment - 14/Apr/09 01:43 AM
Implemented the lazy reporting of bytes as Doug suggested.

I didn't see any particular speedup, but I'm working on my dual-core laptop at the moment If someone could give this a try on a real box, might be worth it. Also will attach a test program


Hadoop QA added a comment - 15/Apr/09 05:25 AM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12405373/TestWriteConcurrency.java
against trunk revision 765025.

+1 @author. The patch does not contain any @author tags.

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no tests are needed for this patch.

-1 patch. The patch command could not apply the patch.

Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/196/console

This message is automatically generated.


Todd Lipcon added a comment - 15/Apr/09 05:41 AM
Looks like the Hadoop QA bot attempted to apply the .java file as a patch... not sure how to convince it to apply the patch file.

dhruba borthakur added a comment - 15/Apr/09 01:39 PM
The hadoopqa patch proceses always picks the last attached file. In this case, this was the java file. Please re-attach the latest and greatest patch to this JIRA and then cancel and submit patch once again. Thanks.

Todd Lipcon added a comment - 15/Apr/09 08:30 PM
reattaching patch for QA

Todd Lipcon added a comment - 15/Apr/09 08:33 PM
Dhruba: I don't appear to have permissions to twiddle the "Patch Available" state on this issue. Hopefully someone else can retrigger QA

dhruba borthakur added a comment - 15/Apr/09 08:45 PM
Todd: I added you as "contributors". Please see if you are able to Cancel and then Submit the patch

Todd Lipcon added a comment - 15/Apr/09 08:48 PM
Yep, that worked. Thanks Dhruba

Hadoop QA added a comment - 07/May/09 05:12 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12405569/hadoop-5318.txt
against trunk revision 772482.

+1 @author. The patch does not contain any @author tags.

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no tests are needed for this patch.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 Eclipse classpath. The patch retains Eclipse classpath integrity.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

-1 contrib tests. The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/295/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/295/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/295/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/295/console

This message is automatically generated.


Todd Lipcon added a comment - 21/May/09 06:18 PM
The test failure reported by Hudson is on the capacity scheduler contrib (unrelated)

Given that this is a performance-related patch, I'd like to hear back from Ben that the patch to be committed shows similar performance gains to the original patch.


Ben Maurer added a comment - 24/May/09 03:39 AM
When I tested stuff out, I got a boost from doing buffering that can't be replicated with just the grouping of atomic increments – however, if we're just going to go with the simple version, this patch is as good as it gets.

A similar patch is needed for input streams.


Todd Lipcon added a comment - 01/Jun/09 04:13 PM
Ben: sorry, I wasn't clear from your last comment - did you try out the newest patch from this issue, or just commenting on what you saw on the original patch?

If you don't have a chance to try it, I can fire it up on an 8 core box somewhere and see what I get..


Todd Lipcon added a comment - 14/Jun/09 03:47 AM
I just tried this patch on an EC2 c1.xlarge instance (8 cores) and couldn't reproduce the expected performance improvements. I updated the test program a bit to run 10 trials and run System.gc in between each (since the first couple trials seem to speed up due to JITting). I was writing into /dev/shm so actual IO performance shouldn't be a factor - just the contention on the statistics lock. I also changed the test program output format to be suitable for loading into R for analysis. Here's the t-test which fails to show a significant improvement (taking 20:80 to chop off the first 2 runs where JIT happened):
> d.0k <- read.table(file="0k.tsv",header=T)
> d.1k <- read.table(file="1k.tsv",header=T)
> t.test((d.1k$rate - d.0k$rate)[20:80])

        One Sample t-test

data:  (d.1k$rate - d.0k$rate)[20:80] 
t = -0.6754, df = 60, p-value = 0.502
alternative hypothesis: true mean is not equal to 0 
95 percent confidence interval:
 -68205.16  33772.57 
sample estimates:
mean of x 
-17216.30 

The p-value = 0.502 is pretty unconvincing.

Any thoughts from the various parties who are seeing this contention in practice?


Todd Lipcon added a comment - 14/Jun/09 03:49 AM
Here's the updated test code that does more trials, gcs, etc

Todd Lipcon added a comment - 14/Jun/09 04:00 AM
Just tried two more situations with the test code. The first was to comment out all of the increments in PositionCache, and again saw no improvement. I also tried Ben's patch "buf.patch" as well with the same test code. The performance increase is quite pronounced:
> d.benm <- read.table(file="benm-patch.tsv", header=T)
> t.test(d.benm$rate - d.0k$rate)

        One Sample t-test

data:  d.benm$rate - d.0k$rate 
t = 32.835, df = 79, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0 
95 percent confidence interval:
 2925074 3302593 
sample estimates:
mean of x 
  3113833 

So, this JIRA should focus on figuring out how we can get that same benefit while addressing Doug's concerns above.


Todd Lipcon added a comment - 14/Jun/09 08:28 AM
It seems like the real culprit here is HADOOP-5598. Switching to the pure java CRC32 makes the benchmark scale nearly linearly to 8 threads.

Todd Lipcon added a comment - 22/Jul/09 07:30 PM
Attaching a benchmark and the resulting box plots. This benchmark writes 1M pairs of <ByteWritable, ByteWritable> into a SequenceFile with varying number of threads. The three series are:

green: Before HADOOP-6148's improvement of CRC32
red: With HADOOP-6148's CRC committed (as it is in trunk)
blue: With HADOOP-6148's CRC plus hadoop-5318.txt from this JIRA

The results clearly show that the CRC implementation made the majority of the difference. Adding this patch seems to result in a very slight improvement. Here's the R script I used to generate the plot:

d.before.crc32 <- read.table(file="results_before_crc32.tsv", header=F)
d.before <- read.table(file="results_before.tsv", header=F)
d.after <- read.table(file="results_after.tsv", header=F)

lims <- c(1000000, 6400000)
log <- "" /* change to "y" to get log scale */

boxplot(V2 ~ V1, d.before.crc32, col="green", ylim=lims, at=(1:8)-0.4, log=log, boxwex=0.2)
boxplot(V2 ~ V1, d.before, col="red", ylim=lims, at=(1:8)-0.2,boxwex=0.2, log=log, add=T)
boxplot(V2 ~ V1, d.after, col="blue", ylim=lims, at=1:8+0.2,boxwex=0.2, log=log,  add=T)
legend(legend=c("before CRC32", "crc, no 5318", "5318+crc"), fill=c("green", "red", "blue"), x="topright",col="black")

Ran the benchmarks on:

Java(TM) SE Runtime Environment (build 1.6.0_14-b08)
Java HotSpot(TM) 64-Bit Server VM (build 14.0-b16, mixed mode)

Machine is a Nehalem box, dual quad core with hyperthreading (/proc/cpuinfo shows 16 CPUs: Intel(R) Xeon(R) CPU X5570 @ 2.93GHz)


Scott Carey added a comment - 22/Jul/09 08:26 PM
Since the CRC code reaches to near peak throughput (90%) at about 128 or 256 byte chunks, a small buffer may be beneficial. Copying extra data is cheap for this size (2 to 4 cache lines – the copy is in L1 cache), and extra delay before the CRC is also not very risky since its in L1 cache very briefly, not sitting off CPU in RAM. On the other hand, if all clients are buffering by at least this size, it is irrelevant. Writing tiny chunks however will impact throughput as the CRC per-call overhead adds up.

See HADOOP-6148 for CRC32 benchmark numbers. Key breakpoints are about 128 to 256 byte chunk size for ~90% peak throughput, and 2k to 4k chunks where peak throughput is reached.


Todd Lipcon added a comment - 22/Jul/09 09:07 PM
Scott: I think if we add buffering it should be at the "client" level - eg in SequenceFileOutputStream or anywhere else that very small records will be written at a high rate. Doing so in the FSDataOutputStream seems heavy-handed, since we don't know what size the writes will usually be, and I'd imagine the usual is much larger than this pathological test case we used for the benchmark.

I'm also not sure I see your point about the L1 cache - in between the small writes there may be arbitrary amounts of computation (eg in the mapper function) that would evict the buffer from cache.


Scott Carey added a comment - 22/Jul/09 09:37 PM
This performance issue is largely (but not completely) fixed by HADOOP-6148, so I'm linking it.

Tsz Wo (Nicholas), SZE added a comment - 23/Jul/09 06:31 PM
HADOOP-6166 may help this.

Todd Lipcon added a comment - 24/Jul/09 07:43 PM
Given that we got the majority of speedup out of the CRC changes, I think we should resolve this ticket as wontfix. I just ran the same benchmark of trunk vs patched on my dual core laptop and the performance was essentially the same. On the Nehalem box it looked like there might have been a few percent speedup, but it seems not-worth-it for the added complexity.

If anyone disagrees (or has a benchmark to back up this patch), please speak up. I'll upload an updated patch against trunk so people can give it a try.


Todd Lipcon added a comment - 24/Jul/09 07:43 PM
Patch against trunk common

Todd Lipcon added a comment - 09/Sep/09 07:14 AM
Resolving as "won't fix" since the problem seems to have been solved by the recent CRC32 improvements.