Thanks for waiting. I'm checking out the design doc.
In proposed approach truncate is performed only on a closed file. If the file is opened for write an
attempt to truncate fails.
Just a style change, but maybe "Truncate cannot be performed on a file which is currently open for writing" would be clearer.
Conceptually, truncate removes all full blocks of the file and then starts a recovery process for the
last block if it is not fully truncated. The truncate recovery is similar to standard HDFS lease recovery
procedure. That is, NameNode sends a DatanodeCommand to one of the DataNodes containing block
replicas. The primary DataNode synchronizes the new length among the replicas, and then confirms it to
the NameNode by sending commitBlockSynchronization() message, which completes the
truncate. Until the truncate recovery is complete the file is assigned a lease, which revokes the ability for
other clients to modify that file.
I think a diagram might help here. The impression I'm getting is that we have some "truncation point" like this:
| A | B | C | D | E | F |
In this case, blocks E and F would be invalidated by the NameNode, and block recovery would begin on block D?
"Conceptually, truncate removes all full blocks of the file" seems to suggest we're removing all blocks, so it might be nice to rewrite this as "Truncate removes all full blocks after the truncation point."
Full blocks if any are deleted instantaneously. And if there is nothing more to truncate NameNode returns success to the client.
They're invalidated instantly, but not deleted instantly, right? Clients may still be reading from them on the various datanodes.
public boolean truncate(Path src, long newLength)
Truncate file src to the specified newLength.
- true if the file have been truncated to the desired newLength and is immediately available to
be reused for write operations such as append, or
- false if a background process of adjusting the length of the last block has been started, and
clients should wait for it to complete before they can proceed with further file updates.
Hmm, do we really need the boolean here? It seems like the client could simply try to reopen the file until it no longer got an RecoveryInProgressException. (or lease exception, as the case may be.) The client will have to do this anyway most of the time, since most truncates don't fall on even block boundaries.
It should be noted that applications that cache data may still see old bytes of the file stored
in the cache. It is advised for such applications to incorporate techniques, which would retire cache
when the data is truncated.
One issue that I see here is that DFSInputStream users will continue to see the old, longer length for a long time potentially. DFSInputStream#locatedBlocks will continue to have the block information it had prior to truncation. And eventually, whenever they try to read from that longer length, they'll get read failures since the blocks will actually be unlinked. These will look like IOExceptions to the user. I don't know if there's a good way around this problem with the design proposed here.
[truncate with snapshots]
I don't think we should commit anything to trunk until we figure out how this integrates with snapshots. It just impacts the design too much. When you start seriously thinking about snapshots, integrating this with block recovery (by adding BEING_TRUNCATED, etc.) does not look like a very good option. A better option would be simply to copy the partial block and have the snapshotted version reference the old block, and the new version reference the (shorter) copy. That corresponds to your approach #3, right? truncate is presumably a rare operation and doing the truncation in-place for non-snapshotted files is an optimization we could do later.
The copy approach is also nice for DFSInputStream, since readers can continue reading from the old (longer) copy until the readers close. If we truncated that copy directly, this would not work.
We could commit this to a branch, but I think we should hold off on committing to trunk until we figure out the snapshot story.