|
In
Oops. It looks like you already sort of answered that above... I should read more carefully!
Initially I thought BUFFER_SIZE (io.file.buffer.size) was 4k in the nightly benchmark tests. When we set bytes.per.checksum to 4k, there was a small improvement, but not good enough.. need to do more testing. I think even with 8k, results will be similar.
Proposed patch that uses 'BUFFER_SIZE' for buffering on Datanode. Since I was not able to reproduce the full extent of the regression seen on nightly benchmarks, will wait till this patch goes through on such run.
The buffering fixes the most of the performance difference observed in nightly tests. Still sort runs are taking a little longer. Looking at aggregate read and write rates during the sort benchmarks (using Simon), most of the difference is attributable to 'long tail' of maps/reduces. It is not clear if Block CRCs or this patch causes a few extra map or reduce failures. Currently looking at various logs. Could you please apply the patch for
TestDFSIO is a simpler test. After analyzing files written during DFSIO-write test, it looks like just handful of slow nodes (disk or network) slowdown the over all job. From namenode logs, time take to write a 320 MB file on 500 nodes varies from 26 sec to 380 sec (on one of the runs with avg of 75 sec). I will look at time taken to write these files during sort. For writes, Hadoop can work around slow nodes problem by avoiding nodes that have many pending writes inside chooseTarget. Since we don't keep track of reads, adaptively avoiding slow nodes is harder. But this problem is more severe for writes. Also once we write less to a node, we will end up reading less as well. Digressing from this jira little bit.. Namenode does not need to track this information. Datanode can report 'active write/reads' in its heartbeat and namenode can give preference to the datanodes that have less active transactions in chooseTarget().
The load is already considered in DFS. Latest patch that adds buffering to various consumers and producers of block data. With this patch most of of the performance gap in benchmarks is closed. With TestDFSIO we are still seeing 3-5% difference on average. Each time this difference can be traced to nodes with slow disks. Whether block crcs makes bad nodes worse is not clear.
This patch adds buffer while writing data to disk as well as while reading from disk. From the tests, buffer while writing is more important. I guess OS read-ahead while reading the data makes buffer for reading. Of course, extra buffering add extra data copies. I will file another jira to remove majority of these copies without changing buffering. Another change is that DataNode opens block file with RandomAccessFile() and seeks to first read position. It used to skip() to the position.
Thanks for the feedback, Konstantin.
(1) - done (3) : Could you elaborate? This Jira is about improved performance with buffering. Buffer for reading the file is less important than buffer for writing the file (on machines tested). But it still helps. +1
On 3: I mean that the performance gains are in single digits percentage-wise, so it is important to minimize memory costs, that is I am agreeing you should file a new issue to deal with redundant data. +1
http://issues.apache.org/jira/secure/attachment/12363381/HADOOP-1649.patch Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/534/testReport/ I just committed this. Thanks Raghu!
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Only once, trunk took 3600 secs, not sure why.
Either of the following fixed this gap :
Regd matching the buffer size :
I will attach a patch that does this. But one negative thing about this is that it has 3 extra copies compared to pre-1134. But this does not affect the benchmarks since usually our benchmarks are not cpu bound on datanodes. The following occur because datanode processes one checksum chunk at a time :
It personally feels pretty bad to be responsible for so many copies even if benchmarks are not affected. (May be in Hadoop-0.15), we can avoid first two copies by changing the write loop a little to do larger buffer reads instead of on checksum chunk. The third can be avoided if checksum is sent on different socket.
Another regression we have seen is with TestDFSIO. This one is not affected noticeably on small clusters (<100) but aggregate rate came down by about 30% on 500 node cluster. This is a shorter test in terms of time and I will look into this as well.