Issue Details (XML | Word | Printable)

Key: HADOOP-2154
Type: Improvement Improvement
Status: Resolved Resolved
Resolution: Fixed
Priority: Major Major
Assignee: Raghu Angadi
Reporter: Konstantin Shvachko
Votes: 0
Watchers: 3
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

Non-interleaved checksums would optimize block transfers.

Created: 06/Nov/07 01:54 AM   Updated: 08/Jul/09 04:42 PM
Return to search
Component/s: None
Affects Version/s: 0.14.0
Fix Version/s: 0.18.0

Time Tracking:
Not Specified

Issue Links:
Reference
 

Resolution Date: 14/May/08 06:39 AM


 Description  « Hide
Currently when a block is transfered to a data-node the client interleaves data chunks with the respective checksums.
This requires creating an extra copy of the original data in a new buffer interleaved with the crcs.
We can avoid extra copying if the data and the crc are fed to the socket one after another.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Konstantin Shvachko made changes - 06/Nov/07 01:55 AM
Field Original Value New Value
Link This issue is related to HADOOP-1702 [ HADOOP-1702 ]
Rajagopal Natarajan made changes - 22/Nov/07 04:25 PM
Assignee Rajagopal Natarajan [ nrajagopal ]
Rajagopal Natarajan added a comment - 26/Nov/07 07:07 AM
Before I start writing the code, I just have a query about this improvement. As I could see, maxbytes per checksum is typically 512 bytes by default. By default, block size is 67108864 bytes. This means that if the chunks would be written to sockets directly (as opposed to accumulating chunks and writing out once a full block is ready) with its checksum, we are decreasing the data:header ratio by a big factor, isn't it? Wouldn't this be inefficient? Or am I missing something?

Raghu Angadi added a comment - 26/Nov/07 05:27 PM
In my initial implementation of HADOOP-1134, I did not keep buffers between socket and datanode (reader and writer). Looks like this jira proposes that. Note that I had to put the buffers back since there was a regression on DFSIO benchmarks and sort. Pretty much none of our benchmarks is cpu intensive on Datanodes.

If we want to get rid of extra buffer copies, I would either look in to one these two :

  1. reorganize the while loop so that there is one extra copy (from disk to user buffer) and not two. i.e. large user buffer directly written to socket (in the case of block read).
  2. Remove both copies by extending the protocol to allow one DATA_CHUNK to allow multiple CHECKSUM chunks. e.g. one DATA_CHUNK would contain 64k worth of block data directly to user buffer and 65k*4/512 checksum bytes at the end. So that Datanode directly reads in to large user buffer and that buffer is written to socket (basically bringing buffer handling back to pre HADOOP-1134).
  3. Using multiple sockets is another option but I am not a fan of it.

eric baldeschwieler added a comment - 29/Nov/07 07:26 AM
+1

We should not be doing these copies and interleaves if we can avoid them.
A lot of change here, but if we could move to a protocol where the client requests a buffer of bytes, rather than just pushing bytes, we could start the response with a CRCs list, followed by the bytes. This would require less RAM on the client side (I think).

Can we just memory map the block and then copy the requested chunk it directly to the socket or use other tricks to reduce copies further? (I'm NIO naive)


eric baldeschwieler added a comment - 29/Nov/07 05:04 PM
Still +1, but apologies. I was thinking about the read case when I wrote the comment, I think what raghu stated makes more sense without my additions.

Konstantin Shvachko added a comment - 29/Nov/07 07:32 PM - edited
Rajagopal, I do not see how the data:header ratio is decreasing here.

This issue is mainly about removing the interleaving buffer layout. Namely, now we partition the original data into chunks,
calculate crc for each chunk and create the following buffer, which subsequently is transferred to a data-node:

data chunk 1 crc for data chunk 1 data chunk 2 crc for data chunk 2 ... data chunk n crc for data chunk n

I propose to change it [back] to

the original data (not partitioned into chunks) crcs for the original data

If you add a header before each data and crc chunk then in current approach you will have 2*n headers, while in the proposed
approach there will be only 2. So the data:header ratio will increase: (|data| + |crc|) / 2n < (|data| + |crc|) / 2

This should let us get rid of that extra buffer that is used to collect all the interleaved pieces together.

And thus the issue is not about "writing the chunks to the socket directly", but rather about removing chunks all together.
Imo, this is related to both reads and writes. May be reads and writes should even share this code.
Removing other redundant buffers is a part of a different issue.

Eric, why do you think transferring crc before the data would require less RAM on the client?
If it does then it definitely makes sense to send crcs before the data bytes.


Doug Cutting added a comment - 29/Nov/07 07:51 PM
I think you meant:
the original data (not partitioned into chunks) crcs for for the original data

I.e., there will be a crc for each chunk in the original data, but the original data will not be broken into chunks. Is that right?


Raghu Angadi added a comment - 29/Nov/07 08:07 PM - edited
Also "the original data (not partitioned into chunks)" does not mean full 128MB right? It is upto something like 64kB or what ever the io buffersize is...
Edit: This is what I mean by 2nd option in 2nd comment above.

Konstantin Shvachko added a comment - 29/Nov/07 09:37 PM
Yes on both comments.

eric baldeschwieler added a comment - 29/Nov/07 09:53 PM
I'm not sure order actually matters. I can think of arguments for either.

Rajagopal Natarajan added a comment - 30/Nov/07 04:26 AM
@Konstantin
Apologies. I had misunderstood your proposal, that I thought it is to avoid the use of backupstream and write each chunk-crc pair to socket directly instead of buffering. Now I understand what you meant.

Nigel Daley made changes - 22/Jan/08 07:32 PM
Fix Version/s 0.16.0 [ 12312740 ]
Raghu Angadi added a comment - 29/Jan/08 10:19 PM
Rajgopal, I am planning to implement this. Let me know if you have already made progress. My approach to to have multiple "checksum chunks" in each DATA_CHUNK as discussed in multiple comments above.

Raghu Angadi added a comment - 29/Jan/08 10:27 PM
Also I am planning to do read and write side as separate patches. The read side should help with HADOOP-2144.

Rajagopal Natarajan added a comment - 30/Jan/08 08:17 AM
Hi Raghu, Please go ahead with it. I haven't progressed much. Only started with it, and couldn't get much time after that.

Rajagopal Natarajan made changes - 30/Jan/08 08:18 AM
Assignee Rajagopal Natarajan [ nrajagopal ] Raghu Angadi [ rangadi ]
eric baldeschwieler added a comment - 31/Jan/08 06:35 AM
Hi Raghu,

I hear we were planning to do some rework of the read protocols in 17 to make them similar to the new write protocols. Correct? I would think we would want to coordinate this work with that. This would imply that per client read we would ship the CRCs for all requested bytes followed by all requested bytes for any given client request, right (or data/crc)?

It's not clear what you are referring to by buffer in your comment. In the final protocol, I don't think the data node should do any CRC interleaving per client request, do you?

Is some of this discussed in another jira? Could you provide a reference if so?


Raghu Angadi made changes - 31/Jan/08 09:27 PM
Link This issue is related to HADOOP-2758 [ HADOOP-2758 ]
Raghu Angadi added a comment - 31/Jan/08 09:32 PM
I chatted with Eric and Konstantin. I think we are on the same page now. Essentially this jira will reduce memory (buffer) copies by rearranging how data is read/written to/from sockets/disk.

I filed a follow up jira HADOOP-2758 for DFS read side. Read path is much simpler and I can work on write part while we are evaluating good or bad affects of HADOOP-2758.


Raghu Angadi added a comment - 14/May/08 06:39 AM
HADOOP-1702 is also committed. Now the data is not transfered interleaved.

Raghu Angadi made changes - 14/May/08 06:39 AM
Resolution Fixed [ 1 ]
Status Open [ 1 ] Resolved [ 5 ]
Doug Cutting made changes - 16/Sep/08 05:32 PM
Fix Version/s 0.18.0 [ 12312972 ]
Owen O'Malley made changes - 08/Jul/09 04:42 PM
Component/s dfs [ 12310710 ]