Issue Details (XML | Word | Printable)

Key: HADOOP-1649
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Blocker Blocker
Assignee: Raghu Angadi
Reporter: Raghu Angadi
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

Performance regression with Block CRCs

Created: 23/Jul/07 11:34 PM   Updated: 08/Jul/09 04:41 PM
Return to search
Component/s: None
Affects Version/s: 0.14.0
Fix Version/s: 0.14.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works HADOOP-1649.patch 2007-08-08 04:10 AM Raghu Angadi 8 kB
Text File Licensed for inclusion in ASF works HADOOP-1649.patch 2007-08-07 04:32 PM Raghu Angadi 7 kB
Text File Licensed for inclusion in ASF works HADOOP-1649.patch 2007-07-25 04:34 AM Raghu Angadi 3 kB

Resolution Date: 10/Aug/07 06:20 PM


 Description  « Hide
Performance is noticeably affected by Block Level CRCs patch (HADOOP-1134). This is more noticeable on writes (randomriter test etc).

With random writer, it takes 20-25% on small cluster (20 nodes) and many be 10% on larger cluster.

There are a few differences in how data is written with 1134. As soon as I can reproduce this, I think it will be easier to fix.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Raghu Angadi added a comment - 24/Jul/07 10:57 PM
With randomwriter (with 10 maps per client) on 13 nodes, I am able to consistently reproduce 3% peformance regression compared pre-1134 trunk. io.file.buffer.size is 64K and io.bytes.per.checksum is 512.
  • pre-1134 : 3190 sec
  • cur trunk : 3290 sec.

Only once, trunk took 3600 secs, not sure why.

Either of the following fixed this gap :

  1. matching buffer size for in/out sockets and for block file while writing a block with buffering on pre-1134
  2. set io.bytes.per.checksum 64K

Regd matching the buffer size :

I will attach a patch that does this. But one negative thing about this is that it has 3 extra copies compared to pre-1134. But this does not affect the benchmarks since usually our benchmarks are not cpu bound on datanodes. The following occur because datanode processes one checksum chunk at a time :

  • from in-socket buffer to small buffer (of size io.bytes.per.checksum)
  • small buffer to mirror-socket buffer,
  • small buffer to block file buffer.

It personally feels pretty bad to be responsible for so many copies even if benchmarks are not affected. (May be in Hadoop-0.15), we can avoid first two copies by changing the write loop a little to do larger buffer reads instead of on checksum chunk. The third can be avoided if checksum is sent on different socket.

Another regression we have seen is with TestDFSIO. This one is not affected noticeably on small clusters (<100) but aggregate rate came down by about 30% on 500 node cluster. This is a shorter test in terms of time and I will look into this as well.


Doug Cutting added a comment - 25/Jul/07 12:10 AM
In HADOOP-1134 we talked of subsequently testing with larger values for bytes.per.checksum. It would be interesting to see how changing this from 512 to to, e.g., 8k affects performance both pre- and post-1134. Have you looked at that?

Doug Cutting added a comment - 25/Jul/07 12:15 AM
Oops. It looks like you already sort of answered that above... I should read more carefully!

Raghu Angadi added a comment - 25/Jul/07 03:37 AM
Initially I thought BUFFER_SIZE (io.file.buffer.size) was 4k in the nightly benchmark tests. When we set bytes.per.checksum to 4k, there was a small improvement, but not good enough.. need to do more testing. I think even with 8k, results will be similar.

Raghu Angadi added a comment - 25/Jul/07 04:34 AM
Proposed patch that uses 'BUFFER_SIZE' for buffering on Datanode. Since I was not able to reproduce the full extent of the regression seen on nightly benchmarks, will wait till this patch goes through on such run.

Raghu Angadi added a comment - 31/Jul/07 06:15 AM

The buffering fixes the most of the performance difference observed in nightly tests. Still sort runs are taking a little longer. Looking at aggregate read and write rates during the sort benchmarks (using Simon), most of the difference is attributable to 'long tail' of maps/reduces. It is not clear if Block CRCs or this patch causes a few extra map or reduce failures. Currently looking at various logs.


Devaraj Das added a comment - 31/Jul/07 12:28 PM
Could you please apply the patch for HADOOP-1651 and see if it helps in the 'long tail' problem.

Raghu Angadi added a comment - 31/Jul/07 11:00 PM

TestDFSIO is a simpler test. After analyzing files written during DFSIO-write test, it looks like just handful of slow nodes (disk or network) slowdown the over all job. From namenode logs, time take to write a 320 MB file on 500 nodes varies from 26 sec to 380 sec (on one of the runs with avg of 75 sec). I will look at time taken to write these files during sort.

For writes, Hadoop can work around slow nodes problem by avoiding nodes that have many pending writes inside chooseTarget. Since we don't keep track of reads, adaptively avoiding slow nodes is harder. But this problem is more severe for writes. Also once we write less to a node, we will end up reading less as well.


Raghu Angadi added a comment - 31/Jul/07 11:31 PM

Digressing from this jira little bit.. Namenode does not need to track this information. Datanode can report 'active write/reads' in its heartbeat and namenode can give preference to the datanodes that have less active transactions in chooseTarget().


Raghu Angadi added a comment - 01/Aug/07 10:08 PM

Digressing from this jira little bit.. Namenode does not need to track this information. Datanode can report 'active write/reads' in its heartbeat and namenode can give preference to the datanodes that have less active transactions in chooseTarget().

The load is already considered in DFS.


Raghu Angadi added a comment - 07/Aug/07 04:32 PM
Latest patch that adds buffering to various consumers and producers of block data. With this patch most of of the performance gap in benchmarks is closed. With TestDFSIO we are still seeing 3-5% difference on average. Each time this difference can be traced to nodes with slow disks. Whether block crcs makes bad nodes worse is not clear.

This patch adds buffer while writing data to disk as well as while reading from disk. From the tests, buffer while writing is more important. I guess OS read-ahead while reading the data makes buffer for reading.

Of course, extra buffering add extra data copies. I will file another jira to remove majority of these copies without changing buffering.

Another change is that DataNode opens block file with RandomAccessFile() and seeks to first read position. It used to skip() to the position.


Konstantin Shvachko added a comment - 08/Aug/07 01:41 AM - edited
  1. SMALL_HDR_BUFFER_SIZE not used any more
  2. "randomAccessFile" and "in" variables in DFSClient.sendBlcok() are related that is point to the same file.
    It would be good if they had something similar in their names like blockFile and blockFileIn, etc.
  3. Does not look like we gain a lot with buffering, so lets make sure memory costs are minimized.
  4. FileUtil.checkDest() can also be removed, as it is private and never used locally.

Raghu Angadi added a comment - 08/Aug/07 04:10 AM
Thanks for the feedback, Konstantin.

(1) - done
(2) - done : renamed them. randomAccessFile becomes blockInFIle and in becomes blockIn.
(4) - done (not really related to this Jira)

(3) : Could you elaborate? This Jira is about improved performance with buffering. Buffer for reading the file is less important than buffer for writing the file (on machines tested). But it still helps.


Konstantin Shvachko added a comment - 08/Aug/07 11:53 PM
+1
On 3: I mean that the performance gains are in single digits percentage-wise, so it is important to minimize memory costs,
that is I am agreeing you should file a new issue to deal with redundant data.


dhruba borthakur added a comment - 10/Aug/07 06:20 PM
I just committed this. Thanks Raghu!