|
I think the easiest way to implement appends (and other file modifications) is to keep blocks immutable, but permit one to alter the block list of a file. The simplest implementation of this would require the client, when appending, to re-read the entire last block of a file and then re-write a longer version it as a new block. This could perhaps be optimized by adding a datanode command that permitted one to start writing a new block on a datanode whose initial contents is that of another block, then append new data to that block. That datanode could chain the append to the other replicas.
I am not sure if this is simpler than allowing the last block to be expanded by datanodes. Is there is any advantage of making a block immutable (other than implementation)?
> Is there is any advantage of making a block immutable [ ... ]
Consider a datanode that's offline during an append and then comes online later. How do we discover that it's replica of a block is out of date? Block identity would have to become <id,length>. And that would only work for append, but not for other block modifications. > Consider a datanode that's offline during an append and then comes online later [ ... ]
<id, length> would work for both append and truncate, it wouldn't work for random writes into a block. A block could have a generation number that gets incremented every time the block is accessed for modification. The generation number would be recorded by the Namenode, which could discard old copies of blocks. Making a block immutable makes appends and truncates very heavyweight. You have to copy on average blocksize*2*num_replicas bytes to make any sort of modification.
> A block could have a generation number that gets incremented every time the block is accessed for modification.
But isn't <blockid+generation> really tantamount to a new block id? How is this any different from simply hard-linking to the old block file and then modifying it? > Making a block immutable makes appends and truncates very heavyweight. You have to copy on average blocksize*2*num_replicas bytes to make any sort of modification. The copies could be local on each datanode, not across the wire. They could be made to be from one drive to another, or in the case of append, you might hard link to the old block file if the client is sure to never read past block end. So append need not be heavy weight at all. If you're willing to possibly break existing readers while truncating, then you could always hard link block files, and never perform any copying. The file could be corrupted, e.g., if a truncate succeeds on the datanode but the block list on the namenode is not updated to reflect that, or vice versa, but I think the blockid+revision approach has exactly the same issue. Copying could prevent such issues, making modifications atomic, but at some cost to performance. > But isn't <blockid+generation> really tantamount to a new block id?
Not really, the implications for implementation are pretty different. If a new block id is to be used, the Namenode has to allocate a new block and delete the old block. Scheduling the old blocks replicas for deletion, dispatching the requests and journaling the new block is a non-trivial amount of Namenode activity. A revision number update can simply be recorded in memory. In the event of a conflict the Namenode would treat the highest revision numbered replicas as valid and discard out of date replicas. > Copying could prevent such issues [ ... ] Copying does make error handling somewhat easier. But it seems to me that it does so only when changes to a file are exposed in the Namenode at a block granularity. If we want to make changes visible at a finer grain both approaches have similar complexity in the corner cases of datanodes and writers crashing in the middle of updates. > A revision number update can simply be recorded in memory.
So the namenode wouldn't persist the block version number, it would just keep track of the highest revision that's yet been reported to it? If for any reason all of the replicas are not updated to the latest revision, the namenode would still need to issue replication and deletion commands, right? But I think you're arguing that in what's hopefully the common case, when all replicas are sucessfully modified, the namenode need only increment its in-memory block revision number and would not need to issue any other commands nor persist any data. That would indeed be a namenode performance advantage to this approach. One cost of this approach is that block revision numbers would consume more memory per block on the namenode, something we wish to minimize. So there is something of a time/space tradeoff. Some edge semantics would be different too. If only one replica is updated, and its datanode dies, and the namenode is restarted, then the change would silently be lost, no? With new block ids, this would result in a file with a missing block that could not be read. Thinking a bit more, it seems to me that with the new block-id approach, one could also manage things so that, when there are no failures, the namenode need not schedule deletions. Where in one approach the datanode would report the new block revision number, in the other approach it could report the old block id that was removed from that datanode. So the primary differences I now see between the approaches are just persistence and namenode memory size.
> One cost of this approach is that block revision numbers would consume more memory per block on the namenode, something we wish to minimize. So there is
> something of a time/space tradeoff. We're saving hundreds of bytes with But seriously, the space cost is very small, an additional 4-8 bytes/block. In all of the reduction in memory usage proposed in > Thinking a bit more, it seems to me that with the new block-id approach, one could also manage things so that [ ... ] That could be certainly be done. One issue with new block ids though is that if a datanode fails in the middle of an update, it would have both the old and new blocks, say <oldid> and <newid>. Replicas where the update succeeded would remove the old block and inform the Namenode, which would remove it from it's blocks map. The old block would remain on the failed datanode and be orphaned. If the Namenode then assigned this <oldid> to a new file, there would be a block-id collision. This problem already exists to some degree. If a file is deleted and a datanode containing that files blocks fails then these orphans remain on the failed node and a collision can happen when it comes back. If we use a revision number that takes the form of a timestamp, it can be used to distinguish not only out of date replicas of a currently existing file but also those from old long deleted files. Two birds and all that ... > If the Namenode then assigned this <oldid> to a new file, there would be a block-id collision.
As you observe, this also can happen if, e.g., a datanode is offline when a file is deleted, etc. So I don't think using new blockids to implement appends really makes this that much more likely. The primary advantage you've presented that versioning blocks has over allocating new blockids is that version numbers might not have to be persisted. But versions come at a cost: they take more memory; they have some worrisome edge conditions; and they introduce new concepts into a complex system, rather than simply building on existing, debugged concepts. None of these are fatal, but I don't yet see versioning as the clearly winning strategy. > If we use a revision number that takes the form of a timestamp, it can be used to distinguish not only out of date replicas of a currently existing file but also those from old long deleted files. I don't follow how this would work. Can you explain more? How would it help in the situation I described above, where only a single replica is updated, its datanode dies, and the namenode is restarted. The file should be corrupt, since the only up-to-date replica of one of its blocks is missing. How would you detect that? Would you declare any file corrupt whose last-modified time was later than that of any of its blocks? That seems fragile. > As you observe, this also can happen if, e.g., a datanode is offline when a file is deleted, etc. So I don't think using new blockids to implement appends really makes this
> that much more likely. You're right, it doesn't make it much more likely. But it is a problem that exists in the system today and something we likely need a solution for. I don't know the extent of the problem, because on our installations a node that's down for some time usually comes back wiped clean. However, in order to solve this problem we need some disambiguating marker on replicas to distinguish deleted replicas from those currently in use. A checksum would work if it was persisted on the Namenode, the Namenode would only accept replicas whose checksum matched what it had on record. A timestamp would work as well and wouldn't need to be persisted, the Namenode would treat the most recent replicas as valid. I'm suggesting that a revision number on a block that takes the form of a timestamp can resolve this issue as well be used to support appends. > How would it help in the situation I described above, where only a single replica is updated [ ... ] We need the right protocol here. If we have a 2-phase protocol where a writer first updates the revision number on all replicas before it starts writing then this issue doesn't arise. If one replica updates it's revision then dies, and the Namenode restarts, then the Namenode wouldn't see the replica on the dead node and accept the remaining 2 replicas as valid (assuming a replication of 3). In the event of a replica failing, the client would re-apply a revision change before writing, now using a different timestamp. The two remaining replicas would get updated. If the dead datanode then came back it's copy would be rejected. There are lots of corner cases here, but I believe they can be resolved such that we never silently lose data. Sameer, can you please succinctly state what you think the relative advantages and disadvantages of using new block ids versus blockids plus non-persistent version timestamps? I've stated what I think these are above, namely that new block ids use slightly less memory, increase the number of persisted events and simplify semantics and recovery. The last item is I believe the most significant.
Advantages of revisions:
Disadvantages of revisions:
I don't think that using new block ids simplifies the semantics and recovery. I should qualify that, new block ids don't simplify implementation and recovery unless the old blocks are copied on append/truncate. If hard links are used the semantics and recovery scenarios are equally complex since deltas will show up in both copies. There is a not insignificant performance downside if blocks are copied on append/truncate. > Unified mechanism for supporting appends and detection of stale replicas
I don't yet see how that's simplified by revisions. Doesn't that come for free with new block ids? The current rules don't need to change. A datanode doesn't report to the namenode that it has a block until it has a complete copy (i.e., the append is done). A file isn't closed until all blocks are sufficiently replicated (i.e., minimum replication of modified blocks is complete). If no replicas of a block are available, then the file is unreadable. There are no "stale" replicas to detect. > Doesn't that come for free with new block ids?
Yes, it does. But it doesn't help with detection of stale replicas from deleted files. I'm saying that revisions can be used to support append and solve the problem of detecting replicas from deleted files. Just wanted to pitch in some context...
Jim stated in the opening of this bug that a single client writing would be enough to address this issue. I agree. But what we should be clearer about is the ultimate desired semantics for readers. I'd define success as having a single client doing appends and flushes as desired (say per line in a log file) and having multiple clients "tail -f" the file and see updates at a reasonable rate, IE soon after each flush or every 64k bytes or so with less than a seconds latency. This would let us build systems that log directly into HDFS and have related systems respond based on those log streams. This is where I'd like to see us get with this issue. Clearly getting there involves getting a handle on all the stuff already discussed in this thread. We also need to think carefully about the pipelining and protocol issues involved in making this work. We might want to break the protocol change issues into another discussion, but I want to make sure we don't converge on solutions that will not work considering fine grained "flushes". > sameer: solve the problem of detecting replicas from deleted files
Is that actually a problem? We'll only generate such garbage blocks when nodes fail, and only a few then, so it doesn't happen at that high of a rate, and the current blockreport mechanism handles that. Block versioning won't let us get rid of the blockreport mechanism, so I still don't see that as an advantage of block versioning. Am I missing something? > eric: see updates at a reasonable rate, IE soon after each flush or every 64k bytes or so with less than a seconds latency That's a new requirement that I'd not heard before. The mechanism I'd proposed would not make appends visible until the file is closed, which is not what Eric's asking for. So this may make the case for block versioning. If we really must support what Eric's asking, then non-persistent timestamps on the blocks may be required.
To be clear, for the hbase scenario, appends need to be available to a reading process even if the file has not been closed (Otherwise updates can be lost). > so it doesn't happen at that high of a rate, and the current blockreport mechanism handles that
It doesn't, a blockreport only report block ids to the Namenode. If a Datanode goes down with garbage blocks, and some of those blockids are assigned to new files, and the Datanode comes back, the garbage blocks will be accepted as valid and cause filesystem corruption. We should try and prevent collisions that can cause corruption. A more complete statement of the HBase use case is:
One process is appending to a file. There is no requirement that the file be available to readers unless
For the latter case, we would be satisfied if a lease timeout on the file caused it to be closed. What happens (in HBase) if the process writing the file dies is that the HBase master server gets notified, and it then needs to open and read the file (it knows where the dead process put it). > If a Datanode goes down with garbage blocks, and some of those blockids are assigned to new files [ ... ]
How is this related to append? Block id collisions are improbable events that perhaps we should concern ourselves with more. It seems you're arguing that we should minimize block id allocation in order to address this, but that seems like a tiny bandaid and mostly irrelevant to the present discussion. The proposal I'd made was that new block ids created by appends should only be reported to the namenode when they're complete, and the namenode should not commit a file's new block id list until all block ids are reported complete, making appends atomic. Such an approach would inhibit rapid visibility of appends, which may be required, and that is a vote against it. So we may have to go with something like non-persistent block timestamps, but I still don't see your point about non-persistent timestamps substantially reducing corruption. Rather they seem to create new opportunities for corruption that may require new mechanisms to prevent such corruptions, and that may just be the price we have to pay. > How is this related to append?
The mechanism used to detect these collisions i.e. non-persistent timestamps can be trivially leveraged to support appends. > So we may have to go with something like non-persistent block timestamps, but I still don't see your point about non-persistent timestamps substantially reducing corruption. They enable detection of stale replicas thereby eliminating that source of corruption. > Rather they seem to create new opportunities for corruption I don't see that. Could you explain? Ah. I think you're intending that datanodes do persist the block timestamp, but the namenode does not? If so, I'd missed that. I thought you'd said that timestamps were not persisted. But if datanodes persist them then that could indeed help detect block id collisions, since a timestamp collision would nearly impossible (assuming clocks are reasonably synchronized and accurate). So would DFSClient send the timestamp with each flushed buffer, and the datanode log it to a log that's replayed on startup?
As for new opportunities for corruption, I simply meant that having multiple versions of a block increases the chances of getting the wrong version. The namenode and datanode will have substantial new logic to handle block versioning, and more logic increases the chances of faulty logic, introduces new failure modes, etc. The proposal I made required far fewer fundamental changes to block semantics, and thus mostly builds on already debugged logic. Changing blocks from immutable to mutable will require us to uncover all the places where we've assumed immutability. Some of these may not be obvious. That's all I meant. Just merrily spreading a little FUD about change of any sort! Why would the name node not keep the timestamp/version? Seems to me it will need it to disambiguate new block reports and such. We can afford the bytes IMO.
Those statements about "reasonably synchronized clocks" worry me. Especially if clients or datanodes are producing them. If the name node is not the source of truth, I'm afraid we can introduce weird cases. I think the next logical step here is for us to outline a more detail proposal for review. I think this is going to take a couple of distinct steps. We can outline those and get to agreement on the basics. Jim's request that a file exist if its producer dies before a close sounds like it would be easy, but then we get to flush semantics... I take it you don't just want the last complete 128M block Jim? If you want the last "log line" then you need something similar to what I outlined. > I think you're intending that datanodes do persist the block timestamp
Yes, that is what I was intending. Sorry, if that wasn't clearer in my description. Timestamps would be persisted on the Datanodes but not on the Namenode. Having the DFSClient generate the timestamps would open us up to some issues with unsynchronized clocks, the probability of error is very low though. Having the Namenode generate the timestamps would be more robust, but it might mean more Namenode transactions depending on how frequently a timestamp is requested. It still exposes us to problems if the clock on the Namenode machine is turned back for some reason. Most robust would be the Namenode generating monotonically increasing revisions. The DFSClient could send the timestamp with each flushed buffer, but it's not clear that we need the timestamps to be so fine grained. A new timestamp could be generated every time a block is accessed for modification (and when an error occurs during an append). This would require no change in the number of Namenode transactions, but the corner cases when a client dies in the middle of a write could be trickier to handle. > As for new opportunities for corruption, I simply meant that having multiple versions of a block increases the chances of getting the wrong version Fair point. Block revisions will mean new code that needs extensive testing and debugging and will likely expose hidden assumptions in the code. We should get a lot of eyes on code that we write for this issue. Given the number of watchers on this issue, it shouldn't be a problem > sameer: [ ... ] but the corner cases when a client dies in the middle of a write could be trickier to handle.
Yes, that worries me. I'd be more comfortable sending a new version per flush, so that we have some atomic ground to stand on. Perhaps we can address clock skew by, as you suggest, using revisions rather than timestamps. The namenode need not be involved in the allocation of each revision, only in initializing the revision counter for a modification sequence. After that, the DFSClient can increment the revision for each buffer. Do we lose anything by not using timestamps? We could avoid collisions of blockid+revision with the filesystem id, no? > eric: the next logical step here is for us to outline a more detail proposal Yep. That'd be good. +1 It just occurred to me that we already have a number that monotonically increases as we append to a block: its length. So long as we don't intend to support file modifications other than appends, this could work as the revision. It would have the advantage that, if a node has a stale copy of a block, it could incrementally update the block. Truncation might be tricky with this approach, but perhaps the current version of the final block could be, rather than the longest version, the version that matches the length of the file. Just a thought.
eric baldeschwieler - 06/Sep/07 01:29 AM
> Jim's request that a file exist if its producer dies before a close sounds like it would be easy, but then we > get to flush semantics... I take it you don't just want the last complete 128M block Jim? If you want the > last "log line" then you need something similar to what I outlined. Yes, we want to get back the last record we wrote. (unless people think a lossy database is ok Typically we write multiple records at a time. It would not be a big deal for us to follow a group of writes with a flush, since if we can't write the whole group, we'd rather not get a partial group back. So this means that we don't need every write to be atomic, just every flush. (Think database transaction with multiple updates: either they all succeed or they all fail) Yes, I'd thought of using length too. We have had requests to support truncate as well. I'm on the fence on that one. It is simple and more provably correct if we don't do truncates and just track block length.
But we've a significant client group that has had an interest in truncates. We'll do some chewing on that one and see if we want to propose one project that does truncates and appends (less disruption to get both features) or appends only and presumably defer our ambition to do truncates for at least a year. The combination of very high block replication counts and network partitions still feels like it might produce some kind of nasty cases here.
Here is a slightly more detailed description of a proposal to support appending writes to files.
There is an DataGenerationStamp associated with every block. It is persisted by the namenode and by the datanode(s).
*The Writer*
1. The client requests the namenode for a new block. The namenode generates a new blockid and associates a DataGenerationStamp
of 0 with this block. It persists this blockId in the inode and the DataGenerationStamp in the BlocksMap. The Namenode
returns the blockid, DataGenerationStamp and block locations to the Client.
2. The Client sends the blockid and DataGenerationStamp to all the datanodes in the pipeline. The Datanodes record the blockid and the
DataGenerationStamp persistently and returns. In case of error, go to Step 3a.
3. The Client then starts streaming data to the Datanodes in the pipeline.
The Client notices if any datanodes in the pipeline encountered an error. In this case:
3a. The Client removes the bad datanode from the pipeline.
3b. The Client requests a new DataGenerationStamp for this block from the NameNode. The Client also informs the Namenode
of the bad Datanode.
3c. The Namenode removes the bad datanode as a valid block location for this block. The Namenode increments the current
DataGenerationStamp by one, persists it, and returns it to the Client.
3d. The Client sends the new DataGenerationStamp to all remaining datanodes in the pipeline.
3e. The Datanodes receive the new DataGenerationStamp and persist it.
3f. The Client can now continue, go back to Step 3 above.
4. The Datanode sends block confirmations to the namenode when the full block is received. The block confirmation has the
blockid and DataGenerationStamp in it.
5. The Namenode receives a block confirmation from a Datanode. If the DataGenerationStamp does not match with what is stored in
the inode, the namenode refuses to consider that Datanode as a valid replica location. The namenode sends a block delete
command to that Datanode.
*Reader (concurrent reading while file is being appended to)*
1. A reader that opens a file gets the list of blocks from the Namenode. Each block has the block locations
and DataGenerationStamp too.
2. A client sends the DataGenerationStamp along with every read request to a datanode. The datanode refuses the serve the
data if the DataGenerationStamp does not match with the value in its persistent store. In this case, the client will fail
over to other datanodes.
This algorithm came out of a discussion with Sameer. This solution does not solve the problem of duplicate blockids
that can result when datanodes that were down for a long time re-appear.
When is the block data transmitted? Does a datanode not persist its stamp for a block until after it has persisted the data corresponding to that stamp? That way, if it crashes before it recieves all of the data, it has an out-of-date stamp. Is that right?
What happens when a block is partially modified and the client crashes? Block data is transmitted in Step 3 above. The datanode(s) persist the DataGenerationStamp as soon as it receives it. Then it starts receiving the data. If it crashes before it received all the data, the Client detects this condition and increments the DataGenerationStamp on the namenode and on the remaining good datanodes.
Let's consider the case when the client crashes before the current write (or flush) is successfully transmitted to all datanodes. In this case it is possible that the Datanodes have different sizes of this block. In this case, the lease expires on the Namenode and it is the namenode's duty to do the recovery that the client would have otherwise done. The namenode fetches from the Datanode the size of each replica of the block-under-modification and selects the largest size block as valid. It increments the DataGenerationStamp and sends this new stamp to the datanodes that have the largest size replica. What happens if the namenode crashes before it could complete the entire lease-timeout-triggered-recovery (described above) for blocks that were being modified? I am thinking about this one. This sounds hairy to me, with lots of edge conditions. And we have not yet specified how truncate will work. If the point of using a DataGenerationStamp instead of simply block length is to support truncates too, then we should design and implement truncate too, else how do we know that the design really is sufficient for truncate?
At this point I'm leaning towards only supporting append and using block length. The semantics are much simpler. Eric? Is there a description of what happens when a client wants to append data to a file?
I think we can get away without supporting truncate in the medium term. Less is more...
That said, I don't have a strong opinion on generation numbers vs length. I'd like to be sure that we are able to detect the stale block issue. Maybe a unique file ID? Does this proposal try to finish with fewer datanodes than requested or does it request new ones? Generation numbers would be needed even if we don't support truncate. Using length is not reliable, there are a few corner cases where data can be silently lost. In addition, the proposal described above isn't completely consistent with our discussions. We'll need to publish something a lot more detailed with explanations of how corner cases are handled.
The above proposal says that if a datanode in the pipeline dies, the client detects this condition but does not request a new datanode as a replacement. If this does not give us an acceptable level of reliability,we might have to design something better.
Here is a more detailed description of the proposal to support appends in HDFS. This is just the first draft and I will be expanding more on it next week. Please review and let me know your feedback.
A Word document that has the same contents as the html design document
That html was pretty unreadable, at least in firefox. I re-generated from the .doc using Open Office.
"The Client notices if any Datanodes in the pipeline encountered an error [... ]".
We need a bit more discussion of how this will work. The typical error will not be a nice message sent back up the pipeline, but rather will be a timeout, breaking the pipeline. We must be careful so that timeouts do not cascade, perhaps by decreasing the timeout for each stage added to the pipeline. Removing a node from the pipeline then means building a new pipeline from at least the node before the fracture onwards. Is this the sort of mechanism you have in mind? This version has description about how appends will work with BlockRebalancer and Datanode Periodic verification. It also has details on how a client-flush call will handle data that is smaller than a crc chunk.
Comments on the document.
In the Requirements section: > There can be many simultaneous writers reading the same file at the same time. Do you mean readers here instead of writers? As far as I understand the internals of the DFS, this appears to satisfy our requirements. A process is writing to a file and has done flushes. Another process detects that the writer is now dead and reads the file (up to the point it was last flushed) and then deletes the file (because it knows the writer is dead). This happens well before the file lease timeout. My understanding is that the file will be deleted and that if another process tries to create a file with the same name, this should succeed and not contain any of the contents of the original file. Is this correct? If so, ++1 One more comment. As I am not entirely familiar with the internals of DFS, I'd like to see a +1 from Doug (Cutting) and/or Konstantin.
Thanks for you comments. Your understanding is absolutely correct. I made the user-case explicit in the section titled "File Deletions and Renames". Thanks.
An html version of this design document.
I think this design looks well-enough thought through that we should probably start trying to implement it.
This design document is great to have! Can you please convert it to forrest XML and add it to a reference section of the documentation tree along with the patch for this issue? Incorporated review comments from Rob Chansler. Thanks Rob.
Hi, everyone
We are making our best effort to improve Hadoop Distributed Our implementation will follow the desigh of Hadoop-1700. In the 0. We recommend: the BlockID don't adopt random 64bit ID. We give a 1. In section The Writer of Hadoop-1700, the original text is: ____________________________________________________________________________________________________________________________ When client successfully updates the all datanodes where the replicas
2. In hadoop-1700, OpenFile transaction log records all the blocklist. We look forward to getting help from Hadoop Community. Your any advice Thanks Ruyue for your comments.
0. Regarding your comment about using a sequential block-id, instead of a random block-id: How do we uprade existing clusters? An existing cluster can have huge number of already allocated blocks. 1. Your comment about Point 1 makes sense. i will update the document. 2. Regarding Point 2: a file typically has few blocks, ranging from somewhere between 2 to 10 blocks. I like your proposal to optimize transaction logging, especially when the number of blocks in a file are huge. Is it possible to consider it as an enhancement and implement it after implementing HDFS Apends? Won't the system be simpler if we avoid this optimization at first go and then once HDFS Append is committed, then we can work on this optimization? this patch gives a simple append implementation based on 0.15.1 version. it is completely compatible with the 0.15.1. It adds a new open() function in DFSClient class.
Usage: 1. open for read InputStream is = (FSDataInputStream)dfsclient.open(path, buffersize, "r"); 2. open for append OutputStream os = (FSDataOutputStream)dfsclient.open(path, buffersize, "a"); 3. open for append (if no file, create it) OutputStream os = (FSDataOutputStream)dfsclient.open(path, buffersize, "a+c"); this patch gives a simple append implementation based on 0.15.1 version. it is completely compatible with the 0.15.1. It adds a new open() function in DFSClient class.
Usage: 1. open for read InputStream is = (FSDataInputStream)dfsclient.open(path, buffersize, "r"); 2. open for append OutputStream os = (FSDataOutputStream)dfsclient.open(path, buffersize, "a"); 3. open for append (if no file, create it) OutputStream os = (FSDataOutputStream)dfsclient.open(path, buffersize, "a+c"); Ruyue:
I know little of hdfs internals but here are a few comments on the patch. Looks like you made your patch with diff going between a java-old and a java-new directory. Can you makie it instead by doing a 'svn diff' inside the top directory of an hadoop checkout? You should also make the patch apply to TRUNK. Currently it does not. Its unlikely that your patch will be added to the 0.15.x branch since usually only bug fixes are made against released versions and as is, it makes it hard for me to take your patch for a test drive. Tabs should be purged and replaced by spaces to make your patch palatable. New classes could do with javadoc explaining their purpose: e.g. BasicAppendInfo class comment could explain where its used and the payload it carries. New methods added to the protocol could also do with javadoc: e.g. baseAppendFile. Also, I'd guess that as is, your patch will emit javadoc warnings which will make it so your patch will fail the patch-build (your createFile method does not detail the method parameters). You also have an instance of empty javadoc. Your getting of blockid is synchronized but not on setting. Hi Ruyue,
I like all of the preceding review comments by stack. I have two basic questions. 1. If a client has opened a file for "append" and is writing to the end of the file. The last block of the file has three replicas. It can so happen that the system fails to write the data to some replicas. What is your error recovery policy? All replicas would be of different size, can you pl point me to the code that handles this inconsistency? I cannot find this code in your patch. 2. The DataNode, while finalizing a block, sends the block-size to the NameNode. In the case of "appends", the block size of the block changes with every write. Can you pl point me to the code that makesthe necessary changes to the Namenode's data structure that reflects the new-size after the append is finished? I could not find this code in your patch. thanks, What's the status on this patch? Is there anything that can be done to move this along? This is long-awaited functionality on many fronts. There was also some mention by Ruyue of a more complete patch coming soon. Would be willing to help out in any way possible.
Hello Chad, I am currently working on
I am canceling this patch because I am not convinced that the current patch does error recovery.
Most of the JIRA linked to this one are fixed or have patches associated with them. That means mos tof the pre-resuisites for supporting appends are already done. i am currently working on a distributed upgrade to upgrade a current cluster to the new disk format (
Once the above pre-requisites are done, the new code for supporting Appends will be started. I am guessing it will be another 1.5 man month before it appears as a draft patch. Should we add another blocking issue to extend the FileSystem API to support append? This might just be adding a new method, 'FSDataOutputStream append(Path)' that throws UnsupportedOperationException by default. We could implement it in the local filesystem then too.
This might be useful to applications (like Creating a new JIRA to decide on the Append API sounds great. I will create a JIRA and link it with H-1700.
HDFS's cache of FileStatus in each Path (
Dhruba, need some guidance here. Do you intend to complete work on append? Are there any append related patches ready for review or commit that you need help with?
Attached a record of some of the discussions of how to begin using sequential block IDs.
Grid_HadoopRenumberBlocks.pdf Hi Sameer, I would like to complete the "append" work as described in the design paper associated with this JIRA. To achieve this, I would like to get
Then, Nicholas's > To achieve this, I would like to get
I think > Hi Sameer, I would like to complete the "append" work as described in the design paper associated with this JIRA
Awesome! Thanks Dhruba. Nicholas will work with you to take this forward. The current generation stamp implementation should be committed. I recorded the discussion of serial block IDs for the curious, but I don't propose changing directions.
Append is almost ready.
1700_20080606.patch: implemented DistributedFileSystem.append(...). TODO:
The patch looks good. This patch introduces the changes to the ClientProtocol. More changes will be required for the data transfer protocol between the client and datanode to implement the full functionality. I documented one of the major remaining issues in
Till we develop a patch for So
For "trigger lease recovery when a lease already exists and is expired", I think NameNode could throw an exception for Lease expired soft limit, then the client can go ahead to trigger the lease recovery. The patch also needs some tests. Finally, the last patch for supporting "append" on HDFS. This patch supports a single simultaneous appender to a file.
I purposely left in a few debug prints for now. This version of the patch is now available for review. This is the final patch for supporting appends. It has an unit test that randomly triggers random-size appends to file of random size with random replication factor.
appendtrunk6.patch does not apply anymore. Could you update it, Dhruba?
Merged patch with latest trunk.
Not able to review the whole patch. Below are something I found so far. Updated to appendtrunk8.patch.
Major changes:
Minor changes:
appendtrunk9.patch: added two more tests for testing appending to an non-existing file and permission checking
However, TestFileAppend2 currently is failed, even for trunk + appendtrunk7.patch. Could you check it, Dhruba? Thanks for the tests and associated code-changes Nicholas. I will look into why the unit test failed on your setup.
appendtrunk10.patch: fixed a bug. Passed TestFileAppend2 on Linux but there is a rename problem in Windows.
appendtrunk11.patch: fixed the rename problem on Windows. The only test it fails is TestDatanodeDeath. It also fails on trunk + appendtrunk7.patch
appendtrunk12.patch: removed some debug codes
This fixes the test failure with TestDatanodeDeath.
Added TestFileAppend2 unit test.
Fixed javac and javadoc warnings. This probably has two Findbugs warnings still to be fixed.
Incorporated most review comments. Let's try HadoopQA tests.
Fixed findbugs warnings. Many thanks to Nicholas for helping me run findbugs.
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12386249/appendtrunk14.patch against trunk revision 677470. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2888/testReport/ This message is automatically generated. Ok, got +1 from hadoopQA. I am planning to check this into trunk by the end of this week!
This is a large patch, and we don't have a test plan or any evidence that the feature was thoroughly tested.
The unit test includes two unit tests
What really need to be tested is different failure scenarios, like when the client or one of data-nodes in the pipeline fail Briefly looked at the code, noticed that you forgot to increment the ClientDatanodeProtocol.versionID although the comment is there. This is a substantial new feature, and 0.18 is still waiting on the fixes to previous changes needed to facilitate append.
Konstantin's previous comment has some useful detail about necessary unit tests. Also there is a necessity for at-scale functional testing. Has there been any functional testing from application programs? At challenging load levels? In competition with other activity? Has performance testing established that append changes do not have an adverse impact on other operations? Does the design documentation need to be updated since November? What about user documentation? As a new feature, the testing requirements are somewhat lower, no? Existing tests should test existing functionality, so we should have some confidence that this won't break existing applications. If new applications that use the new feature are broken then that's less of a problem. So we ought to perhaps do our normal stress-testing (e.g., sort900) before committing this, to ensure that existing features still work well.
Things are actually a bit more complex, since, e.g., a new application that uses this new feature might crash a namenode that's shared with an existing application. One way to address this concern might be to permit the new feature (appends) to be disabled. That said, I think more thorough tests would be good to have, I just question whether they need to hold up committing this. Getting features into trunk earlier, well before a freeze, is also a good way to find problems, problems that might not be found with synthetic tests. So, my concrete proposal: Before we commit this we should,
Does that sound reasonable? Dhruba, wasn't there a test plan for this at one point? I can't seem to find it.
Doug's comments seem reasonable if someone is going to sign up to fix those blockers. In addition to Doug's list, I'd add that a test plan must be published for this feature. I agree with Doug. Also, this patch is pretty light-weight compared to earlier append-related patches. Most of the heavy-lifting has been done in earlier patches that are part of 0.18. This patch does not change any disk format(s).
Making a configurable would have been an option, I will do it as part of this JIRA. Regarding test plan, there is a JIRA opened in Jan 08: http://issues.apache.org/jira/browse/HADOOP-2658 Opened new JIRA to add more append-related tests. http://issues.apache.org/jira/browse/HADOOP-3790 Introduce a configuration option to support/disable hdfs appends.
The answer to Doug's question is, "Not necessarily." But, I'm in general agreement with his proposal with the understanding that the new blockers are assigned in advance.
If dfs.support.append is set to false, then the "append" feature to hdfs will be switched off.
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12386453/appendtrunk15.patch against trunk revision 678080. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2913/testReport/ This message is automatically generated. Hi Nigel/Mukund,
I understand that large-scale performance testing and a test-plan are two things that are keeping this issue from being checked in. Would appreciate it a lot if any of you can help in this regard. thanks in advance, dhruba As far as I understood the latest discussion the prerequisites for this patch is the Doug's proposal plus the test plan.
I'll repeat them here.
The biggest issues are the large scale test and the test plan. > Develop and implement test plan (
I thought we could commit 1700 before we implement all the tests, provided the tests were a 0.19 blocker and folks were committed to implement them ASAP. +1
Dhruba, if you can author a test plan in Ok, i will author a test plan. Will post the test plan to
I had an offline discussion with Nicholas, Sameer, Rob Chansler and Nigel. We agreed that we will open a couple od JIRAs (marked as blockers for 0.19), these will be related to documentation, writing more unit tests, creating a append-benchmark and running large-scale tests to validate existing behaviour and performance. Rob has volunteered to create these JIRAs. Once these are created, we should check in this patch.
This means that it is likely that this patch gets checked into trunk in a day or two. +1 appendtrunk15.patch looks good
ClientDatanodeProtocol version is still 1. Should be 2.
Incorporated Konstantin's review comments.
Rerun HadoopQA one final time!
Rerun HadoopQA one final time!
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12386795/appendtrunk16.patch against trunk revision 679601. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2944/testReport/ This message is automatically generated. Appends to HDFS is finally committed! Many thanks to many many people who contributed to the design, development and testing. Special thanks to Nicholas who developed some key pieces related to this patch!
Integrated in Hadoop-trunk #581 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/581/
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Additionally, since we would do a batch of appends at one time, we would be happy if the appends did not become visible until we called flush() or something like that. We certainly do not have a requirement that every append become visible immediately.