Issue Details (XML | Word | Printable)

Key: HADOOP-1702
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Raghu Angadi
Reporter: Raghu Angadi
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

Reduce buffer copies when data is written to DFS

Created: 09/Aug/07 09:43 PM   Updated: 08/Jul/09 04:42 PM
Component/s: None
Affects Version/s: 0.14.0
Fix Version/s: 0.18.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works HADOOP-1702.patch 2008-05-12 09:45 PM Raghu Angadi 42 kB
Text File Licensed for inclusion in ASF works HADOOP-1702.patch 2008-05-08 07:31 PM Raghu Angadi 42 kB
Text File Licensed for inclusion in ASF works HADOOP-1702.patch 2008-05-08 07:26 PM Raghu Angadi 41 kB
Text File Licensed for inclusion in ASF works HADOOP-1702.patch 2008-05-07 12:16 AM Raghu Angadi 41 kB
Text File Licensed for inclusion in ASF works HADOOP-1702.patch 2008-04-25 01:05 AM Raghu Angadi 39 kB
Text File Licensed for inclusion in ASF works HADOOP-1702.patch 2008-04-17 03:11 PM Raghu Angadi 39 kB
Text File Licensed for inclusion in ASF works HADOOP-1702.patch 2008-04-17 02:55 PM Raghu Angadi 39 kB
Text File Licensed for inclusion in ASF works HADOOP-1702.patch 2008-04-16 10:20 PM Raghu Angadi 39 kB
Text File Licensed for inclusion in ASF works HADOOP-1702.patch 2008-02-23 01:19 AM Raghu Angadi 34 kB
Issue Links:
Reference
 
dependent
 

Hadoop Flags: Reviewed, Incompatible change
Release Note: Reduced buffer copies as data is written to HDFS. The order of sending data bytes and control information has changed, but this will not be observed by client applications.
Resolution Date: 14/May/08 06:35 AM


 Description  « Hide
HADOOP-1649 adds extra buffering to improve write performance. The following diagram shows buffers as pointed by (numbers). Each eatra buffer adds an extra copy since most of our read()/write()s match the io.bytes.per.checksum, which is much smaller than buffer size.
       (1)                 (2)          (3)                 (5)
   +---||----[ CLIENT ]---||----<>-----||---[ DATANODE ]---||--<>-> to Mirror  
   | (buffer)                  (socket)           |  (4)
   |                                              +--||--+
 =====                                                    |
 =====                                                  =====
 (disk)                                                 =====

Currently loops that read and write block data, handle one checksum chunk at a time. By reading multiple chunks at a time, we can remove buffers (1), (2), (3), and (5).

Similarly some copies can be reduced when clients read data from the DFS.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Raghu Angadi added a comment - 13/Feb/08 11:16 PM
The protocol and code has changed quite a bit around write path, but memory copies essentially remain the same as in the picture above.

Raghu Angadi added a comment - 23/Feb/08 01:19 AM - edited
The attached patch reduces buffer copies on DataNode. DataNode read one
packet and writes it to mirror and the local disk. Each packet contains
io.file.buffer.size of data.

Below: 'b.i.s' is BufferedInputStream and 'b.o.s' is BufferedOutputStream, and these are
are much larger than typical size of read or write to them.

Each ---> represents a memory copy.

  • Datanode :
    • before :
                                                           + ---> b.o.s ---> mirror socket
                                                           |
        client socket ---> b.i.s ---> small DataNode buf --|
                                                           |
                                                           + ---> b.o.s ---> local disk 
        
    • after :
                                                         + ---> mirror socket
                                                         |
        client socket ---> Large Datanode buf (Packet) --|
                                                         |
                                                         + ---> local disk 
        
  • Client :
    • before: Client used 64k packets irrespective of io.file.buffer.size.
      So the extra copy for b.o.s was not present if io.file.buffer.size at or
      below 64k. But each packet required 2 writes.
    • after : the size of packet is based on io.file.buffer.size and we use one
      write to write to datanode socket.

I don't have numbers regd cpu savings. In absolute numbers, for given amount of data on one DataNode, CPU saved on DataNode should be larger than CPU saved when same amount is read with HADOOP-2758.

DFSIO benchmark numbers have been very sensitive to buffering (not to CPU) while writing, we need to show this patch does not negatively affect this benchmark.


Raghu Angadi added a comment - 23/Feb/08 01:21 AM
The attached patch applies after applying the patch from HADOOP-2758.

Raghu Angadi added a comment - 06/Mar/08 04:04 AM - edited
Test results show 30% improvement in DataNode CPU with the patch. I think it makes sense. Based on the picture above before this patch, with replication of 3, the data is copied 6 + 6 + 4 times and with this patch it is 3 + 3 + 2. Each of these datanodes verify CRC. Approximating cost of checksumming to be twice that of a memory copy, we get (8+6)/(14+6) == 70%. If we increase the size of checksum chunk, cost of CRC will go down. It is be 68% with a factor of 1.5 for CRC.

Test Setup : three instances of 'dd if=/dev/zero 4Gb | hadoop -put - 4Gb'. More importantly, DataNode was modified to write the deta to '/dev/null' instead of the block file. Otherwise I could not isolate the test from disk activity. The cluster has 3 datanodes. The clients, Namenode, and datanodes are all running on the same node. The test was CPU bound.

CPU measurement : Linux reports a process' cpu in /proc/pid/stat : 14th entry is user cpu and 15th is kernel cpu. I think these are specified in jiffies. Like most things with Linux kernel, these are approximations but reasonably dependable in large numbers. The numbers reported are sum of cpu for each of the three datanodes.

below: 'u' and 'k' are user and kernel cpu in thousands of jiffies.

Test Run 1 Run 2 Run 3 Avg Total Cpu Avg Time
Trunk* 8.60u 2.52k 372s 8.36u 2.48k 368s 8.39u 2.40k 368s 10.95 369s
Trunk + patch* 5.61u 2.22k 289s 5.38u 2.16k 296s 5.57u 2.25k 289s 7.73 (70%) 291s (79%)

* : datanodes write data to /dev/null.

Currently, DFSIO benchmark shows dip in write b/w. I am still looking into it.


Raghu Angadi added a comment - 03/Apr/08 02:43 AM
The dip in DFSIO benchmark turned out to be because of the fact that DFSIO creates files with a buffersize of 1000000!. The buffersize passed while creating file is passes on to FileSystem implementation (DFSClient in this case). This brings up the question on how an implementation can treat user specified buffersize. Can increasing buffersize (as in this case) reduce performance, i.e. should an implementation allow it?

This is what happens on trunk:

  • user specified buffesize is effectively does not matter on trunk.
  • Client sends buffers up packets of 64k size and flushes them after the pkt if full. There could at most 10 such packets in the pipeline at a time.. usually much less.
  • DataNodes use io.file.buffer.size for their streams.

With the patch here :

  • user specified bufferesize sets the packet size.
  • at DataNodes, packet size dictates write size for mirror stream and local file (i.e. it io.file.buffer.size does not matter).
  • The rest is same.

Another proposal :

  • packetSize = Min( 64k, buffersize );
  • Max # packets in pipeline = Max(buffersize/packetSize, 10)

'64k' here could be made an configurable (may be "dfs.write.packet.size") so that different 'real' buffer sizes could be used for experimentation.

How does the above proposal sound?


dhruba borthakur added a comment - 07/Apr/08 06:49 AM
I would vote for keeping the packetSize fixed at 64K and not make it dependent on any user defined configuration parameter.

Raghu Angadi added a comment - 07/Apr/08 03:56 PM
Whats wrong with a internal variable with default of 64k? How do we know 64k is the best for all platforms, though its a pretty good value?

dhruba borthakur added a comment - 07/Apr/08 05:20 PM
We could certainly fetch the value from the conf, my point was not to insert this configuration parameter in the hadoop-defaults.xml file. Do you agree?

Raghu Angadi added a comment - 07/Apr/08 05:34 PM
Yeah. This won't be in hadoop-defaults.xml. It will be an internal variable.

Raghu Angadi added a comment - 16/Apr/08 10:20 PM
Patch updated for trunk.

Raghu Angadi added a comment - 25/Apr/08 01:05 AM
Patch updated for trunk.

Hairong Kuang added a comment - 06/May/08 11:13 PM
A few initial comments:
1. Once Packet.getBuffer() is called, no more data can be written to packet. This is not obvious to code readers. Better add this restriction to the comment.
2. packetSize/writePacketSize in DFSClient don't include the size of the packet header. I think it is better to rename them to packetPayloadSize/writePacketPayloadSize.
3. The packet size guess calculation in DataNode should match the calculation in DFSClient.

Raghu Angadi added a comment - 07/May/08 12:16 AM
Thanks Hairong. Attached patch fixes the above.
Regd #2, changed packetSize to include the header rather than changing its name.

Hairong Kuang added a comment - 08/May/08 05:48 PM
A few more comments:
1. Currently DFSClient moves the checksum down to fill the gap when a partial packet needs to be sent. To avoid the copy, we could instead store the checksums in the reversed order starting from the dataStart.
2. Datanode does not verify checksums before a packet is sent to the down stream datanode in the pipeline. This is a change from the current trunk's behavior.
3. In Datanode.BlockReceiver.readnextPacket, it is clearer to change the variable "pktLen" to be payloadLen.
4. Datanode.BlockReceiver.checksumOut does not need a big buffer. SMALL_BUFFER_SIZE should do.

Tsz Wo (Nicholas), SZE added a comment - 08/May/08 06:07 PM
> 1. Currently DFSClient moves the checksum down to fill the gap when a partial packet needs to be sent. To avoid the copy, we could instead store the checksums in the reversed order starting from the dataStart.

Or, is it good to interleave data and checksums?


Raghu Angadi added a comment - 08/May/08 06:15 PM
Thanks for the review Hairong.
  • #1 : interesting suggestion. It would be a change in protocol that affects other datanode transfers like reading etc. We rarely send partial packets (mainly fsync). Also checksum data is less than one percent of the packet. I hope it is ok for this patch. Note that this is not a memory copy that did not exist before (previously all the data was copied).
  • #2: Yes it is a change from previous behavior. Before this patch it didn't matter since we handled 512 bytes at a time. The receiving datanode verifies the checksum anyway. Checking checksum after tunneling data downstream (theoretically) reduces latency. This is the same reason datanode first sends the data to mirror and then stores the data locally.
  • #3. sure.
  • #4. yes.

Raghu Angadi added a comment - 08/May/08 06:17 PM
> Or, is it good to interleave data and checksums?
. The main purpose of this and some more patches is not to interleave data and checksums.

Raghu Angadi added a comment - 08/May/08 07:26 PM
Updated patch has changes related to Hairong's review.

Raghu Angadi added a comment - 08/May/08 07:31 PM
DATA_TRANSFER_VERSION is incremented.

Hadoop QA added a comment - 09/May/08 02:55 AM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12381710/HADOOP-1702.patch
against trunk revision 654315.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 3 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

-1 findbugs. The patch appears to introduce 2 new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2432/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2432/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2432/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2432/console

This message is automatically generated.


Hairong Kuang added a comment - 09/May/08 07:13 PM
+1. The patch looks good.

Raghu Angadi added a comment - 12/May/08 09:45 PM
Fixed findbugs warnings and ran 'ant patch'.

Hadoop QA added a comment - 12/May/08 11:04 PM
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12381909/HADOOP-1702.patch
against trunk revision 655593.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 3 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2452/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2452/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2452/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2452/console

This message is automatically generated.


Raghu Angadi added a comment - 14/May/08 06:35 AM
I just committed this.

Hudson added a comment - 14/May/08 12:43 PM