Issue Details (XML | Word | Printable)

Key: HADOOP-3515
Type: Improvement Improvement
Status: Closed Closed
Resolution: Won't Fix
Priority: Major Major
Assignee: dhruba borthakur
Reporter: dhruba borthakur
Votes: 0
Watchers: 3
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

Protocol changes to allow appending to the last partial crc chunk of a file

Created: 08/Jun/08 11:16 PM   Updated: 08/Jul/09 04:43 PM
Return to search
Component/s: None
Affects Version/s: None
Fix Version/s: None

Time Tracking:
Not Specified

Issue Links:
Blocker
 

Resolution Date: 25/Jul/08 06:16 PM


 Description  « Hide
To support "appending" to an existing file, we need the ability to append data to the last partial crc chunk of the file.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
dhruba borthakur made changes - 08/Jun/08 11:16 PM
Field Original Value New Value
Link This issue blocks HADOOP-1700 [ HADOOP-1700 ]
dhruba borthakur added a comment - 08/Jun/08 11:16 PM

We have two approaches:

1. The client is unaware of how much data can go into the pre-existing last crc chunk. The client buffers (as usual) all new data written by the application, when a crc chunk is full, it sends it to the datanode(s). The datanode(s) know that part of this newly arriving chunk has to be appended to the last partial crc chunk that already existed on disk. It reads the last partial crc chunk from disk, appends however much of new data can be filled up into that crc chunk and writes the crc chunk back. This logic need to be executed only by the primary (first) datanode in the pipeline.

The advantage of this approach is that multiple appenders can be supported in future. The disadvantage of this approach is that the crc has to be computed once by the client and again by the primary datanode.

2. The second approach would be such that the client fetches the contents of the last crc chunk from the datanode (and buffers it) when the file is first opened for append. It then appends newly written data to this buffered chunk. When the chink is full, it sends it to the datanode pipeline.

The advantage of this approach is that crcs do not need to be generated at two places. It can be generated only by the client. The disadvantage of this approach is that supporting multiple concurrent appenders is going to be infeasible.


Tsz Wo (Nicholas), SZE added a comment - 09/Jun/08 10:22 PM
I think we could have a third option.

3. Client know the original file size. It can add padding for the first chunk, so that crc boundary will be aligned with the original block. Then, the primary datanode only has to compute the crc for the first chunk but not the afterward.

For example, suppose the original file size is 3123 bytes, block size is 2000 bytes and crc chunk size is 500 bytes. The last block original has 1123 bytes. The client will add 123 bytes prefix padding for the first chunk.


dhruba borthakur added a comment - 17/Jun/08 06:54 PM
Hi Nicholas, for the third option that you list, how will it work for multiple appenders? i.e. if two clients are writing to the end of the same block?

Tsz Wo (Nicholas), SZE added a comment - 17/Jun/08 08:32 PM
Hi Dhruba, you are right that (3) does not support concurrent appenders.

dhruba borthakur added a comment - 18/Jun/08 01:41 AM
So, do you concur that we can adopt approach 1?

Tsz Wo (Nicholas), SZE added a comment - 18/Jun/08 05:23 PM
If concurrent appenders is one of the requirement, the only choice is (1).

dhruba borthakur added a comment - 25/Jul/08 06:16 PM
The code change related to this one was checked in as part of HADOOP-1700.

dhruba borthakur made changes - 25/Jul/08 06:16 PM
Resolution Won't Fix [ 2 ]
Fix Version/s 0.19.0 [ 12313211 ]
Status Open [ 1 ] Resolved [ 5 ]
Doug Cutting made changes - 16/Sep/08 05:22 PM
Fix Version/s 0.19.0 [ 12313211 ]
Doug Cutting made changes - 16/Sep/08 05:24 PM
Status Resolved [ 5 ] Closed [ 6 ]
Owen O'Malley made changes - 08/Jul/09 04:43 PM
Component/s dfs [ 12310710 ]