Issue Details (XML | Word | Printable)

Key: HADOOP-2758
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Raghu Angadi
Reporter: Raghu Angadi
Votes: 0
Watchers: 2
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

Reduce memory copies when data is read from DFS

Created: 31/Jan/08 09:24 PM   Updated: 08/Jul/09 04:42 PM
Component/s: None
Affects Version/s: None
Fix Version/s: 0.17.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works HADOOP-2758.patch 2008-02-29 02:38 AM Raghu Angadi 33 kB
Text File Licensed for inclusion in ASF works HADOOP-2758.patch 2008-02-26 10:22 PM Raghu Angadi 33 kB
Text File Licensed for inclusion in ASF works HADOOP-2758.patch 2008-02-26 03:56 AM Raghu Angadi 33 kB
Text File Licensed for inclusion in ASF works HADOOP-2758.patch 2008-02-22 09:30 PM Raghu Angadi 33 kB
Text File Licensed for inclusion in ASF works HADOOP-2758.patch 2008-02-19 10:20 PM Raghu Angadi 28 kB
Text File Licensed for inclusion in ASF works HADOOP-2758.patch 2008-02-13 10:36 PM Raghu Angadi 21 kB
Issue Links:
Reference
 
dependent
 

Release Note: DataNode takes 50% less CPU while serving data to clients.
Resolution Date: 04/Mar/08 07:11 PM


 Description  « Hide
Currently datanode and client part of DFS perform multiple copies of data on the 'read path' (i.e. path from storage on datanode to user buffer on the client). This jira reduces these copies by enhancing data read protocol and implementation of read on both datanode and the client. I will describe the changes in next comment.

Requirement is that this fix should reduce CPU used and should not cause regression in any benchmarks. It might not improve the benchmarks since most benchmarks are not cpu bound.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Raghu Angadi added a comment - 13/Feb/08 07:05 AM
With a prelimirary patch that removes extra copies on datanode while reading a block, the results are promising.
Ran 4 instances of 'dfs -cat 5GbFile > /dev/null' similar to the tests in HADOOP-2144. All the blocks are local.

branch-0.16 : ~4 min. cpu bound. user cpu is 3 times the kernel cpu.
trunk + patch : ~3min. disk bound. user cpu is 2 times the kernel cpu. not that much of cpu was left (~10-20%).

Also from HADOOP-2144, datanode cpu is around 0.9 times DFSClient cpu. Even after ignoring idle cpu in the second test, datanode takes less than half of cpu with the patch. This includes both user and kernel cpu taken by datanode. Assuming kernel cpu is same in both cases, the user cpu taken by datanode in second test would much less than half (may be closer 1/3rd).


Raghu Angadi added a comment - 13/Feb/08 10:36 PM

Attached patch removes extra buffer copies when data is read from the data node (by client or while replicating).

  • before : disk --> large bufferedinputstream --> small datanode buffer --> large bufferedoutputstream --> socket.
  • after : disk --> large datanode buffer --> socket
  • each arrow represents a memory copy. cost of arrows at the ends is share between user and kernel, I think (using direct buffer might further reduce that, will try.).

I will post more microbenchmarks similar to last comment.

We can reduce one copy on the DFSClient. Current readChunk() interface in FSInputChecker does not allow it. We could add optional readChunks() so that an implementation can get access to user's complete buffer. There will be a default implementation of this. Should I file a jira?

This patch changes the DATA_TRANSFER_PROTOCOL a bit.

Currently there are no improvements in buffering whilre writing data to DFS. I will do that in a follow up jira.

All the unit tests pass. I will run them on windows as well. No new tests are added since this does not actually change any functionality and purely a performance improvement.


Raghu Angadi added a comment - 13/Feb/08 10:39 PM
Also this is friendly to Java's FileChannel.transferTo() interface. That can reduce one more memory copy. This will be good experiment.

Raghu Angadi added a comment - 14/Feb/08 10:37 PM
With 6 instances of 'dfs -cat 5GbFile > /dev/null' both tests were cpu bound.

with out the patch : ~ 6.5 min
with the patch : ~ 3.5 min.


Konstantin Shvachko added a comment - 15/Feb/08 02:43 AM
  1. Could you please add comments about the format of the send/receive protocol for blocks.
    We do not have formal definition of the protocol, and it is very hard to understand the code
    without knowing the general structure of the packet and the bytes and nits it is composed of.
  2. Do we still need the notion of a chunk? Previously it corresponded to 512 data bytes prefixed
    by a 4 byte checksum. Now it is just a sequence if checksums followed by data.
    I think any mentioning of chunks should be either removed or replaced by something else in order
    to avoid confusion with the previous version.
  3. Too many (20 or so) members in BlockSender and BlockReceiver.
    • Some of them can be made local variables
      checksumSize
    • some are used merely to pass parameters between or into internal methods
      BlockSender.throttler
      BlockSender.out
      BlockSender.corruptChecksumOk
      BlockSender.chunkOffsetOK
    • and some are derivatives (computable) from other variables.
    • Please check the BlockReceiver as well.
    • Most of them are not introduced by this patch, but it'd be great if you could work on
      that to make the code more understandable. Right now it is in incomprehensible state.
  4. Constants should be named in all capital letterers.
    static final int packetHeaderLen =
  5. DATA_TRANFER_VERSION
    • Spelling TRANSFER
    • Please leave javadoc-style comments in place.
    • Please remove previous version descriptions.
    • I generally do not understand what is the meaning of this constant, introduced btw in HADOOP-1134.
      If DatanodeInfo serialization changes then both ClientProtocol and DatanodeProtocol versions should be
      incremented. Why do we need extra version?
      Only the last version changes should be described, the old ones could be retrieved from svn.
  6. enumerateThreadGroup() is not used locally.
  7. Spelling: should be "stream"
    private InputStream blockIn; // data strean

Sorry, I did not get to the essentials of your patch. But it is really hard to understand without
knowing the details of the protocol.


Raghu Angadi added a comment - 15/Feb/08 03:35 AM - edited
Thanks for the detailed review Konstatin.

Just a clarification to the readers: except 4, rest are not introduced by this patch. I will try to fix some of them.

When this protocol was first introduced, it was described in the jira (HADOOP-1134). Since then, it has evolved. I agree that it will be better to have the ascii diagrams in the code comments itself. I will surely add that.


Raghu Angadi added a comment - 15/Feb/08 06:18 PM
Note regd 'dfs -cat' numbers: These are end to end tests and numbers vary depending on how many instances we run. Just as in any end-to-end test there are multiple factors that are not affected by this patch. This patch reduces CPU consumed by DataNode while serving data. It cannot be directly comapred from 'dfs -cat' numbers.

I have semi-directly calculated DataNode with the patch takes 35-45% of CPU it used to take before. This calculation uses 9/10 ratio from HADOOP-2144. 'top' on my dev box truncates summed up cpu to 99.9 (unlike on the machine used in HADOOP-2144), other wise we could directly compare CPU taken by DataNode instead of calculating it indirectly.

Sameer asked me to compare single instance of 'dfs -cat' and regular shell 'cat'. I will add those numbers in the next comment.


Raghu Angadi added a comment - 15/Feb/08 07:09 PM - edited
Comparision of single instance of 'dfs -cat 5Gbfile > /dev/null" with 'cat 5Gbfile > /dev/null'. All the data resides locally on a 4 disk RAID0 partition :
min:sec cat dfs -cat with 0.16 dfs -cat with the patch
run 1 2:40 3:44 3:24
run 2 2:56 3:05 3:51
run 3 3:01 3:18 2:51

What would you conclude? Both of the obvious conclusions are incorrect :

  1. dfs -cat is almost as good as simple cat.
  2. this patch does not help much.

If we had a single disk partition, the numbers would be even closer.


Raghu Angadi added a comment - 15/Feb/08 07:26 PM
Regd couple of concerns in Konstantin's review :
  • > 2. Do we still need the notion of a chunk? [...]
    • I think so. A CRC chunk is still central to many things that DataNode and DFSClients do. It is very useful for discussions, descriptions and even in code to have a single word to consistently describe this essential unit of DFS data. If we see a member called 'sendChunk()', its clear what it sends. For e.g. this patch renamed sendChunk() to sendChunks(int) because it sends multiple CRC chunks.
  • > 5. DATA_TRANSFER_VERSION : I generally do not understand what is the meaning of this constant, [...]
    • data transfers do not use RPCs. As noted in the comment, it unfortunately does depend on Datanode serializations. Probably it should not. This is analogous RPC versions and a Protocol version, which are at two different levels of the stack.

dhruba borthakur added a comment - 15/Feb/08 07:56 PM
The most recent version of the Data Transfer Protocol can be found as an attachment to HADOOP-1707. It might be worthwhile to update this document with the changes proposed in this JIRA.

Raghu Angadi added a comment - 15/Feb/08 07:59 PM
Dhruba,

As Konstantin suggested, it is an internal protocol and probably best to keep the protocol documentation in (large) code comments. Its harder to go back HADOOP-1707 to HADOOP-1134 to extract the relevant information.


Runping Qi added a comment - 15/Feb/08 09:29 PM

I expect you will see big difference if you run 4 concurrent cats, one for a file on different disk, vs. 4 concurrent dfs cats.


Konstantin Shvachko added a comment - 16/Feb/08 01:54 AM
I understand now DATA_TRANSFER_VERSION is a part of Data Transfer Protocol.
> sendChunks(int) ... sends multiple CRC chunks.
I would rather think of it that it is sending portions of data along with corresponding CRCs, rather than CRCs with the data attached.

Raghu Angadi added a comment - 19/Feb/08 10:20 PM
Konstantin, updated patch add all suggested changes except #3 above (reducing the number of member variables). I think it is a large enough change to have its own Jira.

The essence of the patch is at the top of second comment (difference between "before" and "after"). Please feel free to ping me regd explanation of any changes.


Raghu Angadi added a comment - 22/Feb/08 09:30 PM
More changes under src/test caused by fixing DATA_TRANFER_VERSION to DATA_TRANSFER_VERSION.

Konstantin Shvachko added a comment - 26/Feb/08 02:59 AM
  1. The comments are very useful, thanks.
  2. "4 byte Legth of actual data" should read Le_n_gth.
  3. BlockSender
    • sendBuf member should be a local variable of sendChunks().
      We always call the constructor BlockSender() followed by only one call to sendChunks().
      So there is no advantage of allocating the buffer in the constructor and then using it in sendChunks().
    • Same with maxChunksPerPacket
  4. +1 other than that

Raghu Angadi added a comment - 26/Feb/08 03:56 AM
Thank Konstantin. Update patch includes all the changes.

regd sendBuf and maxChunksPerPacket: these are made local to sendBlock(), I think thats what you meant (instead of sendChunks(), which invoked in sendBlock() in a loop).

Will make this patch available once DFSIO benchmark shows the buffering is same as before.


Raghu Angadi added a comment - 26/Feb/08 10:22 PM
minor variable changes.

Hadoop QA added a comment - 27/Feb/08 07:38 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12376557/HADOOP-2758.patch
against trunk revision 619744.

@author +1. The patch does not contain any @author tags.

tests included +1. The patch appears to include 6 new or modified tests.

javadoc +1. The javadoc tool did not generate any warning messages.

javac +1. The applied patch does not generate any new javac compiler warnings.

release audit +1. The applied patch does not generate any new release audit warnings.

findbugs -1. The patch appears to introduce 1 new Findbugs warnings.

core tests +1. The patch passed core unit tests.

contrib tests +1. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1849/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1849/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1849/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1849/console

This message is automatically generated.


Raghu Angadi added a comment - 29/Feb/08 02:38 AM

fixed findbugs warning.

Many thanks to Mukund for running benchmarks with the patch. The results look fine and we are going to double check with another run. We want make sure 'effective buffering' hasn't changed.


Hadoop QA added a comment - 29/Feb/08 06:27 AM
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12376788/HADOOP-2758.patch
against trunk revision 619744.

@author +1. The patch does not contain any @author tags.

tests included +1. The patch appears to include 6 new or modified tests.

javadoc +1. The javadoc tool did not generate any warning messages.

javac +1. The applied patch does not generate any new javac compiler warnings.

release audit +1. The applied patch does not generate any new release audit warnings.

findbugs +1. The patch does not introduce any new Findbugs warnings.

core tests +1. The patch passed core unit tests.

contrib tests +1. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1873/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1873/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1873/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1873/console

This message is automatically generated.


Raghu Angadi added a comment - 01/Mar/08 04:13 AM - edited
Edit : typos

More comparisions. I hope this shows the improvements.

Test : Run 6 instances of 'cat 5GbFile > /dev/null' using a single node cluster. All the blocks are located on local disks (RAID0 I think).

The hdfs tests include constant costs : Client cpu and kernel cpu not on behalf of user processes. Client cpu is at least as much as DataNodes. This implies, 25% improvement in wall clock time implies more that 50% improvement in DataNode cpu.

Test Bound By Run1 Run2 Run3 Avg Percentage Note
Trunk CPU 355 332 346 344 100%  
Trunk + patch CPU 225 213 228 222 65%  
cat command Disk IO 185 83 105 124 36% Not really comparable

Even 21 instances of 'cat allBlocksForFile > /dev/null' was not CPU bound. 'cat' takes virtually zero cpu in user space.


Konstantin Shvachko added a comment - 03/Mar/08 09:29 PM
+1 test results look encouraging.

Hudson added a comment - 04/Mar/08 01:10 PM

Raghu Angadi added a comment - 04/Mar/08 07:11 PM
I committed this yesterday.

Hudson added a comment - 12/Mar/08 12:18 PM