Issue Details (XML | Word | Printable)

Key: HADOOP-1700
Type: New Feature New Feature
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: dhruba borthakur
Reporter: stack
Votes: 11
Watchers: 27
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

Append to files in HDFS

Created: 09/Aug/07 06:45 PM   Updated: 08/Jul/09 04:42 PM
Component/s: None
Affects Version/s: 0.15.1
Fix Version/s: 0.19.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works 1700_20080606.patch 2008-06-06 10:10 PM Tsz Wo (Nicholas), SZE 14 kB
Text File Licensed for inclusion in ASF works append.patch 2007-12-26 02:49 AM Ruyue Ma 22 kB
Text File Licensed for inclusion in ASF works append3.patch 2008-06-29 10:49 AM dhruba borthakur 55 kB
Microsoft Word Licensed for inclusion in ASF works Appends.doc 2007-11-29 05:53 PM dhruba borthakur 76 kB
Microsoft Word Licensed for inclusion in ASF works Appends.doc 2007-11-16 06:39 AM dhruba borthakur 70 kB
HTML File Licensed for inclusion in ASF works Appends.html 2007-11-16 06:40 AM dhruba borthakur 45 kB
Text File Licensed for inclusion in ASF works appendtrunk10.patch 2008-07-15 12:56 AM Tsz Wo (Nicholas), SZE 81 kB
Text File Licensed for inclusion in ASF works appendtrunk11.patch 2008-07-15 06:21 PM Tsz Wo (Nicholas), SZE 80 kB
Text File Licensed for inclusion in ASF works appendtrunk12.patch 2008-07-15 07:03 PM Tsz Wo (Nicholas), SZE 80 kB
Text File Licensed for inclusion in ASF works appendtrunk13.patch 2008-07-16 09:44 PM dhruba borthakur 79 kB
Text File Licensed for inclusion in ASF works appendtrunk13.patch 2008-07-16 08:15 PM dhruba borthakur 79 kB
Text File Licensed for inclusion in ASF works appendtrunk13.patch 2008-07-16 01:39 AM dhruba borthakur 64 kB
Text File Licensed for inclusion in ASF works appendtrunk14.patch 2008-07-17 01:02 AM dhruba borthakur 79 kB
Text File Licensed for inclusion in ASF works appendtrunk14.patch 2008-07-17 12:40 AM dhruba borthakur 80 kB
Text File Licensed for inclusion in ASF works appendtrunk15.patch 2008-07-19 01:25 AM dhruba borthakur 80 kB
Text File Licensed for inclusion in ASF works appendtrunk16.patch 2008-07-24 01:24 PM dhruba borthakur 80 kB
Text File Licensed for inclusion in ASF works appendtrunk6.patch 2008-07-07 10:58 PM dhruba borthakur 73 kB
Text File Licensed for inclusion in ASF works appendtrunk7.patch 2008-07-08 10:36 PM dhruba borthakur 73 kB
Text File Licensed for inclusion in ASF works appendtrunk8.patch 2008-07-10 02:43 AM Tsz Wo (Nicholas), SZE 76 kB
Text File Licensed for inclusion in ASF works appendtrunk9.patch 2008-07-14 09:09 PM Tsz Wo (Nicholas), SZE 79 kB
PDF File Licensed for inclusion in ASF works Grid_HadoopRenumberBlocks.pdf 2008-05-07 10:53 PM Robert Chansler 74 kB
Issue Links:
Blocker
 
Dependants
 
Reference
dependent
 

Hadoop Flags: Reviewed, Incompatible change
Release Note:
Introduced append operation for HDFS files.
Resolution Date: 25/Jul/08 06:09 PM


 Description  « Hide
Request for being able to append to files in HDFS has been raised a couple of times on the list of late. For one example, see http://www.nabble.com/HDFS%2C-appending-writes-status-tf3848237.html#a10916193. Other mail describes folks' workarounds because this feature is lacking: e.g. http://www.nabble.com/Loading-data-into-HDFS-tf4200003.html#a12039480 (Later on this thread, Jim Kellerman re-raises the HBase need of this feature). HADOOP-337 'DFS files should be appendable' makes mention of file append but it was opened early in the life of HDFS when the focus was more on implementing the basics rather than adding new features. Interest fizzled. Because HADOOP-337 is also a bit of a grab-bag – it includes truncation and being able to concurrently read/write – rather than try and breathe new life into HADOOP-337, instead, here is a new issue focused on file append. Ultimately, being able to do as the google GFS paper describes – having multiple concurrent clients making 'Atomic Record Append' to a single file would be sweet but at least for a first cut at this feature, IMO, a single client appending to a single HDFS file letting the application manage the access would be sufficent.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Jim Kellerman added a comment - 10/Aug/07 05:04 PM
For HBase, we only need a single appender.

Additionally, since we would do a batch of appends at one time, we would be happy if the appends did not become visible until we called flush() or something like that. We certainly do not have a requirement that every append become visible immediately.


Doug Cutting added a comment - 30/Aug/07 05:17 PM
I think the easiest way to implement appends (and other file modifications) is to keep blocks immutable, but permit one to alter the block list of a file. The simplest implementation of this would require the client, when appending, to re-read the entire last block of a file and then re-write a longer version it as a new block. This could perhaps be optimized by adding a datanode command that permitted one to start writing a new block on a datanode whose initial contents is that of another block, then append new data to that block. That datanode could chain the append to the other replicas.

Raghu Angadi added a comment - 30/Aug/07 05:47 PM
I am not sure if this is simpler than allowing the last block to be expanded by datanodes. Is there is any advantage of making a block immutable (other than implementation)?

Doug Cutting added a comment - 30/Aug/07 06:35 PM
> Is there is any advantage of making a block immutable [ ... ]

Consider a datanode that's offline during an append and then comes online later. How do we discover that it's replica of a block is out of date? Block identity would have to become <id,length>. And that would only work for append, but not for other block modifications.


Sameer Paranjpye added a comment - 30/Aug/07 06:45 PM
> Consider a datanode that's offline during an append and then comes online later [ ... ]

<id, length> would work for both append and truncate, it wouldn't work for random writes into a block. A block could have a generation number that gets incremented every time the block is accessed for modification. The generation number would be recorded by the Namenode, which could discard old copies of blocks.


Sameer Paranjpye added a comment - 30/Aug/07 06:49 PM
Making a block immutable makes appends and truncates very heavyweight. You have to copy on average blocksize*2*num_replicas bytes to make any sort of modification.

Doug Cutting added a comment - 30/Aug/07 07:24 PM
> A block could have a generation number that gets incremented every time the block is accessed for modification.

But isn't <blockid+generation> really tantamount to a new block id? How is this any different from simply hard-linking to the old block file and then modifying it?

> Making a block immutable makes appends and truncates very heavyweight. You have to copy on average blocksize*2*num_replicas bytes to make any sort of modification.

The copies could be local on each datanode, not across the wire. They could be made to be from one drive to another, or in the case of append, you might hard link to the old block file if the client is sure to never read past block end. So append need not be heavy weight at all.

If you're willing to possibly break existing readers while truncating, then you could always hard link block files, and never perform any copying. The file could be corrupted, e.g., if a truncate succeeds on the datanode but the block list on the namenode is not updated to reflect that, or vice versa, but I think the blockid+revision approach has exactly the same issue. Copying could prevent such issues, making modifications atomic, but at some cost to performance.


Sameer Paranjpye added a comment - 31/Aug/07 03:27 AM
> But isn't <blockid+generation> really tantamount to a new block id?

Not really, the implications for implementation are pretty different. If a new block id is to be used, the Namenode has to allocate a new block and delete the old block. Scheduling the old blocks replicas for deletion, dispatching the requests and journaling the new block is a non-trivial amount of Namenode activity. A revision number update can simply be recorded in memory. In the event of a conflict the Namenode would treat the highest revision numbered replicas as valid and discard out of date replicas.

> Copying could prevent such issues [ ... ]

Copying does make error handling somewhat easier. But it seems to me that it does so only when changes to a file are exposed in the Namenode at a block granularity. If we want to make changes visible at a finer grain both approaches have similar complexity in the corner cases of datanodes and writers crashing in the middle of updates.


Doug Cutting added a comment - 31/Aug/07 03:47 AM
> A revision number update can simply be recorded in memory.

So the namenode wouldn't persist the block version number, it would just keep track of the highest revision that's yet been reported to it? If for any reason all of the replicas are not updated to the latest revision, the namenode would still need to issue replication and deletion commands, right? But I think you're arguing that in what's hopefully the common case, when all replicas are sucessfully modified, the namenode need only increment its in-memory block revision number and would not need to issue any other commands nor persist any data. That would indeed be a namenode performance advantage to this approach.

One cost of this approach is that block revision numbers would consume more memory per block on the namenode, something we wish to minimize. So there is something of a time/space tradeoff.

Some edge semantics would be different too. If only one replica is updated, and its datanode dies, and the namenode is restarted, then the change would silently be lost, no? With new block ids, this would result in a file with a missing block that could not be read.


Doug Cutting added a comment - 31/Aug/07 04:58 AM
Thinking a bit more, it seems to me that with the new block-id approach, one could also manage things so that, when there are no failures, the namenode need not schedule deletions. Where in one approach the datanode would report the new block revision number, in the other approach it could report the old block id that was removed from that datanode. So the primary differences I now see between the approaches are just persistence and namenode memory size.

Sameer Paranjpye added a comment - 31/Aug/07 06:09 AM
> One cost of this approach is that block revision numbers would consume more memory per block on the namenode, something we wish to minimize. So there is
> something of a time/space tradeoff.

We're saving hundreds of bytes with HADOOP-1687, what's the use of doing that if we don't use them

But seriously, the space cost is very small, an additional 4-8 bytes/block. In all of the reduction in memory usage proposed in HADOOP-1687, there has been very little culling of file and block metadata. Most of the savings come from the use of better containers, sorted lists vs treemaps, for instance and from limiting the total number of Java objects the Namenode has in memory. My point is that the data members of file and block objects have not been the source of our memory problems and are unlikely to be so even if we add a few dozen bytes to them.

> Thinking a bit more, it seems to me that with the new block-id approach, one could also manage things so that [ ... ]

That could be certainly be done.

One issue with new block ids though is that if a datanode fails in the middle of an update, it would have both the old and new blocks, say <oldid> and <newid>. Replicas where the update succeeded would remove the old block and inform the Namenode, which would remove it from it's blocks map. The old block would remain on the failed datanode and be orphaned. If the Namenode then assigned this <oldid> to a new file, there would be a block-id collision. This problem already exists to some degree. If a file is deleted and a datanode containing that files blocks fails then these orphans remain on the failed node and a collision can happen when it comes back.

If we use a revision number that takes the form of a timestamp, it can be used to distinguish not only out of date replicas of a currently existing file but also those from old long deleted files. Two birds and all that ...


Doug Cutting added a comment - 31/Aug/07 05:37 PM
> If the Namenode then assigned this <oldid> to a new file, there would be a block-id collision.

As you observe, this also can happen if, e.g., a datanode is offline when a file is deleted, etc. So I don't think using new blockids to implement appends really makes this that much more likely.

The primary advantage you've presented that versioning blocks has over allocating new blockids is that version numbers might not have to be persisted. But versions come at a cost: they take more memory; they have some worrisome edge conditions; and they introduce new concepts into a complex system, rather than simply building on existing, debugged concepts. None of these are fatal, but I don't yet see versioning as the clearly winning strategy.

> If we use a revision number that takes the form of a timestamp, it can be used to distinguish not only out of date replicas of a currently existing file but also those from old long deleted files.

I don't follow how this would work. Can you explain more? How would it help in the situation I described above, where only a single replica is updated, its datanode dies, and the namenode is restarted. The file should be corrupt, since the only up-to-date replica of one of its blocks is missing. How would you detect that? Would you declare any file corrupt whose last-modified time was later than that of any of its blocks? That seems fragile.


Sameer Paranjpye added a comment - 31/Aug/07 09:51 PM
> As you observe, this also can happen if, e.g., a datanode is offline when a file is deleted, etc. So I don't think using new blockids to implement appends really makes this
> that much more likely.

You're right, it doesn't make it much more likely. But it is a problem that exists in the system today and something we likely need a solution for. I don't know the extent of the problem, because on our installations a node that's down for some time usually comes back wiped clean. However, in order to solve this problem we need some disambiguating marker on replicas to distinguish deleted replicas from those currently in use. A checksum would work if it was persisted on the Namenode, the Namenode would only accept replicas whose checksum matched what it had on record. A timestamp would work as well and wouldn't need to be persisted, the Namenode would treat the most recent replicas as valid.

I'm suggesting that a revision number on a block that takes the form of a timestamp can resolve this issue as well be used to support appends.

> How would it help in the situation I described above, where only a single replica is updated [ ... ]

We need the right protocol here. If we have a 2-phase protocol where a writer first updates the revision number on all replicas before it starts writing then this issue doesn't arise. If one replica updates it's revision then dies, and the Namenode restarts, then the Namenode wouldn't see the replica on the dead node and accept the remaining 2 replicas as valid (assuming a replication of 3). In the event of a replica failing, the client would re-apply a revision change before writing, now using a different timestamp. The two remaining replicas would get updated. If the dead datanode then came back it's copy would be rejected. There are lots of corner cases here, but I believe they can be resolved such that we never silently lose data.


Doug Cutting added a comment - 04/Sep/07 08:31 PM
Sameer, can you please succinctly state what you think the relative advantages and disadvantages of using new block ids versus blockids plus non-persistent version timestamps? I've stated what I think these are above, namely that new block ids use slightly less memory, increase the number of persisted events and simplify semantics and recovery. The last item is I believe the most significant.

Sameer Paranjpye added a comment - 04/Sep/07 09:44 PM
Advantages of revisions:
  • Unified mechanism for supporting appends and detection of stale replicas
  • No need for persistence of revisions

Disadvantages of revisions:

  • Slightly more memory on the Namenode

I don't think that using new block ids simplifies the semantics and recovery. I should qualify that, new block ids don't simplify implementation and recovery unless the old blocks are copied on append/truncate. If hard links are used the semantics and recovery scenarios are equally complex since deltas will show up in both copies. There is a not insignificant performance downside if blocks are copied on append/truncate.


Doug Cutting added a comment - 04/Sep/07 11:00 PM
> Unified mechanism for supporting appends and detection of stale replicas

I don't yet see how that's simplified by revisions. Doesn't that come for free with new block ids? The current rules don't need to change. A datanode doesn't report to the namenode that it has a block until it has a complete copy (i.e., the append is done). A file isn't closed until all blocks are sufficiently replicated (i.e., minimum replication of modified blocks is complete). If no replicas of a block are available, then the file is unreadable. There are no "stale" replicas to detect.


Sameer Paranjpye added a comment - 04/Sep/07 11:33 PM
> Doesn't that come for free with new block ids?

Yes, it does. But it doesn't help with detection of stale replicas from deleted files. I'm saying that revisions can be used to support append and solve the problem of detecting replicas from deleted files.


eric baldeschwieler added a comment - 05/Sep/07 05:38 AM
Just wanted to pitch in some context...

Jim stated in the opening of this bug that a single client writing would be enough to address this issue. I agree. But what we should be clearer about is the ultimate desired semantics for readers. I'd define success as having a single client doing appends and flushes as desired (say per line in a log file) and having multiple clients "tail -f" the file and see updates at a reasonable rate, IE soon after each flush or every 64k bytes or so with less than a seconds latency.

This would let us build systems that log directly into HDFS and have related systems respond based on those log streams.

This is where I'd like to see us get with this issue. Clearly getting there involves getting a handle on all the stuff already discussed in this thread. We also need to think carefully about the pipelining and protocol issues involved in making this work.

We might want to break the protocol change issues into another discussion, but I want to make sure we don't converge on solutions that will not work considering fine grained "flushes".


Doug Cutting added a comment - 05/Sep/07 05:25 PM
> sameer: solve the problem of detecting replicas from deleted files

Is that actually a problem? We'll only generate such garbage blocks when nodes fail, and only a few then, so it doesn't happen at that high of a rate, and the current blockreport mechanism handles that. Block versioning won't let us get rid of the blockreport mechanism, so I still don't see that as an advantage of block versioning. Am I missing something?

> eric: see updates at a reasonable rate, IE soon after each flush or every 64k bytes or so with less than a seconds latency

That's a new requirement that I'd not heard before. The mechanism I'd proposed would not make appends visible until the file is closed, which is not what Eric's asking for. So this may make the case for block versioning. If we really must support what Eric's asking, then non-persistent timestamps on the blocks may be required.


stack added a comment - 05/Sep/07 06:58 PM

Doug: The mechanism I'd proposed would not make appends visible until the file is closed, which is not what Eric's asking for...

To be clear, for the hbase scenario, appends need to be available to a reading process even if the file has not been closed (Otherwise updates can be lost).


Sameer Paranjpye added a comment - 05/Sep/07 07:31 PM
> so it doesn't happen at that high of a rate, and the current blockreport mechanism handles that

It doesn't, a blockreport only report block ids to the Namenode. If a Datanode goes down with garbage blocks, and some of those blockids are assigned to new files, and the Datanode comes back, the garbage blocks will be accepted as valid and cause filesystem corruption. We should try and prevent collisions that can cause corruption.


Jim Kellerman added a comment - 05/Sep/07 07:36 PM
A more complete statement of the HBase use case is:

One process is appending to a file. There is no requirement that the file be available to readers unless

  • the file is closed (what Hadoop has now)
  • the process writing to the file dies. (not currently available)

For the latter case, we would be satisfied if a lease timeout on the file caused it to be closed.

What happens (in HBase) if the process writing the file dies is that the HBase master server gets notified, and it then needs to open and read the file (it knows where the dead process put it).


Doug Cutting added a comment - 05/Sep/07 07:56 PM
> If a Datanode goes down with garbage blocks, and some of those blockids are assigned to new files [ ... ]

How is this related to append? Block id collisions are improbable events that perhaps we should concern ourselves with more. It seems you're arguing that we should minimize block id allocation in order to address this, but that seems like a tiny bandaid and mostly irrelevant to the present discussion.

The proposal I'd made was that new block ids created by appends should only be reported to the namenode when they're complete, and the namenode should not commit a file's new block id list until all block ids are reported complete, making appends atomic. Such an approach would inhibit rapid visibility of appends, which may be required, and that is a vote against it. So we may have to go with something like non-persistent block timestamps, but I still don't see your point about non-persistent timestamps substantially reducing corruption. Rather they seem to create new opportunities for corruption that may require new mechanisms to prevent such corruptions, and that may just be the price we have to pay.


Sameer Paranjpye added a comment - 05/Sep/07 08:06 PM
> How is this related to append?

The mechanism used to detect these collisions i.e. non-persistent timestamps can be trivially leveraged to support appends.

> So we may have to go with something like non-persistent block timestamps, but I still don't see your point about non-persistent timestamps substantially reducing corruption.

They enable detection of stale replicas thereby eliminating that source of corruption.

> Rather they seem to create new opportunities for corruption

I don't see that. Could you explain?


Doug Cutting added a comment - 05/Sep/07 08:32 PM
Ah. I think you're intending that datanodes do persist the block timestamp, but the namenode does not? If so, I'd missed that. I thought you'd said that timestamps were not persisted. But if datanodes persist them then that could indeed help detect block id collisions, since a timestamp collision would nearly impossible (assuming clocks are reasonably synchronized and accurate). So would DFSClient send the timestamp with each flushed buffer, and the datanode log it to a log that's replayed on startup?

As for new opportunities for corruption, I simply meant that having multiple versions of a block increases the chances of getting the wrong version. The namenode and datanode will have substantial new logic to handle block versioning, and more logic increases the chances of faulty logic, introduces new failure modes, etc. The proposal I made required far fewer fundamental changes to block semantics, and thus mostly builds on already debugged logic. Changing blocks from immutable to mutable will require us to uncover all the places where we've assumed immutability. Some of these may not be obvious. That's all I meant. Just merrily spreading a little FUD about change of any sort!


eric baldeschwieler added a comment - 06/Sep/07 08:29 AM
Why would the name node not keep the timestamp/version? Seems to me it will need it to disambiguate new block reports and such. We can afford the bytes IMO.

Those statements about "reasonably synchronized clocks" worry me. Especially if clients or datanodes are producing them. If the name node is not the source of truth, I'm afraid we can introduce weird cases.

I think the next logical step here is for us to outline a more detail proposal for review. I think this is going to take a couple of distinct steps. We can outline those and get to agreement on the basics.

Jim's request that a file exist if its producer dies before a close sounds like it would be easy, but then we get to flush semantics... I take it you don't just want the last complete 128M block Jim? If you want the last "log line" then you need something similar to what I outlined.


Sameer Paranjpye added a comment - 06/Sep/07 05:05 PM
> I think you're intending that datanodes do persist the block timestamp

Yes, that is what I was intending. Sorry, if that wasn't clearer in my description. Timestamps would be persisted on the Datanodes but not on the Namenode. Having the DFSClient generate the timestamps would open us up to some issues with unsynchronized clocks, the probability of error is very low though. Having the Namenode generate the timestamps would be more robust, but it might mean more Namenode transactions depending on how frequently a timestamp is requested. It still exposes us to problems if the clock on the Namenode machine is turned back for some reason. Most robust would be the Namenode generating monotonically increasing revisions.

The DFSClient could send the timestamp with each flushed buffer, but it's not clear that we need the timestamps to be so fine grained. A new timestamp could be generated every time a block is accessed for modification (and when an error occurs during an append). This would require no change in the number of Namenode transactions, but the corner cases when a client dies in the middle of a write could be trickier to handle.

> As for new opportunities for corruption, I simply meant that having multiple versions of a block increases the chances of getting the wrong version

Fair point. Block revisions will mean new code that needs extensive testing and debugging and will likely expose hidden assumptions in the code. We should get a lot of eyes on code that we write for this issue. Given the number of watchers on this issue, it shouldn't be a problem


Doug Cutting added a comment - 06/Sep/07 05:52 PM
> sameer: [ ... ] but the corner cases when a client dies in the middle of a write could be trickier to handle.

Yes, that worries me. I'd be more comfortable sending a new version per flush, so that we have some atomic ground to stand on. Perhaps we can address clock skew by, as you suggest, using revisions rather than timestamps. The namenode need not be involved in the allocation of each revision, only in initializing the revision counter for a modification sequence. After that, the DFSClient can increment the revision for each buffer. Do we lose anything by not using timestamps? We could avoid collisions of blockid+revision with the filesystem id, no?

> eric: the next logical step here is for us to outline a more detail proposal

Yep. That'd be good. +1


Doug Cutting added a comment - 07/Sep/07 03:37 AM
It just occurred to me that we already have a number that monotonically increases as we append to a block: its length. So long as we don't intend to support file modifications other than appends, this could work as the revision. It would have the advantage that, if a node has a stale copy of a block, it could incrementally update the block. Truncation might be tricky with this approach, but perhaps the current version of the final block could be, rather than the longest version, the version that matches the length of the file. Just a thought.

Jim Kellerman added a comment - 07/Sep/07 05:49 PM
eric baldeschwieler - 06/Sep/07 01:29 AM
> Jim's request that a file exist if its producer dies before a close sounds like it would be easy, but then we
> get to flush semantics... I take it you don't just want the last complete 128M block Jim? If you want the
> last "log line" then you need something similar to what I outlined.

Yes, we want to get back the last record we wrote. (unless people think a lossy database is ok

Typically we write multiple records at a time. It would not be a big deal for us to follow a group of writes with a flush, since if we can't write the whole group, we'd rather not get a partial group back.

So this means that we don't need every write to be atomic, just every flush. (Think database transaction with multiple updates: either they all succeed or they all fail)


eric baldeschwieler added a comment - 08/Sep/07 06:44 AM
Yes, I'd thought of using length too. We have had requests to support truncate as well. I'm on the fence on that one. It is simple and more provably correct if we don't do truncates and just track block length.

But we've a significant client group that has had an interest in truncates. We'll do some chewing on that one and see if we want to propose one project that does truncates and appends (less disruption to get both features) or appends only and presumably defer our ambition to do truncates for at least a year.

The combination of very high block replication counts and network partitions still feels like it might produce some kind of nasty cases here.


dhruba borthakur added a comment - 25/Sep/07 07:04 PM
Here is a slightly more detailed description of a proposal to support appending writes to files.
There is an DataGenerationStamp associated with every block. It is persisted by the namenode and by the datanode(s).

*The Writer*
1. The client requests the namenode for a new block. The namenode generates a new blockid and associates a DataGenerationStamp 
    of 0 with this block. It persists this blockId in the inode and the DataGenerationStamp in the BlocksMap. The Namenode 
    returns the blockid, DataGenerationStamp and block locations to the Client.
2. The Client sends the blockid and DataGenerationStamp to all the datanodes in the pipeline. The Datanodes record the blockid and the 
    DataGenerationStamp persistently and returns. In case of error, go to Step 3a.
3. The Client then starts streaming data to the Datanodes in the pipeline.
    The Client notices if any datanodes in the pipeline encountered an error. In this case:
    3a. The Client removes the bad datanode from the pipeline.
    3b. The Client requests a new DataGenerationStamp for this block from the NameNode. The Client also informs the Namenode 
          of the bad Datanode.
    3c. The Namenode removes the bad datanode as a valid block location for this block. The Namenode increments the current 
          DataGenerationStamp by one, persists it, and returns it to the Client.
    3d. The Client sends the new DataGenerationStamp to all remaining datanodes in the pipeline.
    3e. The Datanodes receive the new DataGenerationStamp and persist it.
    3f. The Client can now continue, go back to Step 3 above.
4. The Datanode sends block confirmations to the namenode when the full block is received. The block confirmation has the 
    blockid and DataGenerationStamp in it.
5. The Namenode receives a block confirmation from a Datanode. If the DataGenerationStamp does not match with what is stored in 
    the inode, the namenode refuses to consider that Datanode as a valid replica location. The namenode sends a block delete 
    command to that Datanode.

*Reader (concurrent reading while file is being appended to)*
1.  A reader that opens a file gets the list of blocks from the Namenode. Each block has the block locations 
    and DataGenerationStamp too.
2.  A client sends the DataGenerationStamp along with every read request to a datanode. The datanode refuses the serve the 
     data if the DataGenerationStamp does not match with the value in its persistent store. In this case, the client will fail 
     over to other datanodes.

This algorithm came out of a discussion with Sameer. This solution does not solve the problem of duplicate blockids 
that can result when datanodes that were  down for a long time re-appear.


Doug Cutting added a comment - 25/Sep/07 07:25 PM
When is the block data transmitted? Does a datanode not persist its stamp for a block until after it has persisted the data corresponding to that stamp? That way, if it crashes before it recieves all of the data, it has an out-of-date stamp. Is that right?

What happens when a block is partially modified and the client crashes?


dhruba borthakur added a comment - 26/Sep/07 07:11 PM
Block data is transmitted in Step 3 above. The datanode(s) persist the DataGenerationStamp as soon as it receives it. Then it starts receiving the data. If it crashes before it received all the data, the Client detects this condition and increments the DataGenerationStamp on the namenode and on the remaining good datanodes.

Let's consider the case when the client crashes before the current write (or flush) is successfully transmitted to all datanodes. In this case it is possible that the Datanodes have different sizes of this block. In this case, the lease expires on the Namenode and it is the namenode's duty to do the recovery that the client would have otherwise done. The namenode fetches from the Datanode the size of each replica of the block-under-modification and selects the largest size block as valid. It increments the DataGenerationStamp and sends this new stamp to the datanodes that have the largest size replica.

What happens if the namenode crashes before it could complete the entire lease-timeout-triggered-recovery (described above) for blocks that were being modified? I am thinking about this one.


Doug Cutting added a comment - 26/Sep/07 07:39 PM
This sounds hairy to me, with lots of edge conditions. And we have not yet specified how truncate will work. If the point of using a DataGenerationStamp instead of simply block length is to support truncates too, then we should design and implement truncate too, else how do we know that the design really is sufficient for truncate?

At this point I'm leaning towards only supporting append and using block length. The semantics are much simpler. Eric?


Raghu Angadi added a comment - 26/Sep/07 10:08 PM
Is there a description of what happens when a client wants to append data to a file?

eric baldeschwieler added a comment - 27/Sep/07 12:01 AM
I think we can get away without supporting truncate in the medium term. Less is more...
That said, I don't have a strong opinion on generation numbers vs length.

I'd like to be sure that we are able to detect the stale block issue. Maybe a unique file ID?

Does this proposal try to finish with fewer datanodes than requested or does it request new ones?


Sameer Paranjpye added a comment - 27/Sep/07 03:50 PM
Generation numbers would be needed even if we don't support truncate. Using length is not reliable, there are a few corner cases where data can be silently lost. In addition, the proposal described above isn't completely consistent with our discussions. We'll need to publish something a lot more detailed with explanations of how corner cases are handled.

dhruba borthakur added a comment - 27/Sep/07 05:06 PM
The above proposal says that if a datanode in the pipeline dies, the client detects this condition but does not request a new datanode as a replacement. If this does not give us an acceptable level of reliability,we might have to design something better.

dhruba borthakur added a comment - 19/Oct/07 09:52 AM
Here is a more detailed description of the proposal to support appends in HDFS. This is just the first draft and I will be expanding more on it next week. Please review and let me know your feedback.

dhruba borthakur added a comment - 19/Oct/07 09:55 AM
A Word document that has the same contents as the html design document

Doug Cutting added a comment - 19/Oct/07 04:34 PM
That html was pretty unreadable, at least in firefox. I re-generated from the .doc using Open Office.

Doug Cutting added a comment - 19/Oct/07 04:53 PM
"The Client notices if any Datanodes in the pipeline encountered an error [... ]".

We need a bit more discussion of how this will work. The typical error will not be a nice message sent back up the pipeline, but rather will be a timeout, breaking the pipeline. We must be careful so that timeouts do not cascade, perhaps by decreasing the timeout for each stage added to the pipeline. Removing a node from the pipeline then means building a new pipeline from at least the node before the fracture onwards. Is this the sort of mechanism you have in mind?


dhruba borthakur added a comment - 15/Nov/07 07:56 AM
This version has description about how appends will work with BlockRebalancer and Datanode Periodic verification. It also has details on how a client-flush call will handle data that is smaller than a crc chunk.

Jim Kellerman added a comment - 16/Nov/07 05:08 AM
Comments on the document.

In the Requirements section:

> There can be many simultaneous writers reading the same file at the same time.

Do you mean readers here instead of writers?

As far as I understand the internals of the DFS, this appears to satisfy our requirements.
I think the section entitled "File Deletions and Renames" addresses our needs, but let me be sure by asking the following:

A process is writing to a file and has done flushes. Another process detects that the writer is now dead and reads the file (up to the point it was last flushed) and then deletes the file (because it knows the writer is dead). This happens well before the file lease timeout. My understanding is that the file will be deleted and that if another process tries to create a file with the same name, this should succeed and not contain any of the contents of the original file. Is this correct?

If so, ++1


Jim Kellerman added a comment - 16/Nov/07 05:14 AM
One more comment. As I am not entirely familiar with the internals of DFS, I'd like to see a +1 from Doug (Cutting) and/or Konstantin.

dhruba borthakur added a comment - 16/Nov/07 06:39 AM
Thanks for you comments. Your understanding is absolutely correct. I made the user-case explicit in the section titled "File Deletions and Renames". Thanks.

dhruba borthakur added a comment - 16/Nov/07 06:40 AM
An html version of this design document.

Doug Cutting added a comment - 19/Nov/07 07:12 PM
I think this design looks well-enough thought through that we should probably start trying to implement it.

This design document is great to have! Can you please convert it to forrest XML and add it to a reference section of the documentation tree along with the patch for this issue?


dhruba borthakur added a comment - 29/Nov/07 05:53 PM
Incorporated review comments from Rob Chansler. Thanks Rob.

Ruyue Ma added a comment - 29/Nov/07 06:57 PM
Hi, everyone

We are making our best effort to improve Hadoop Distributed
Filesystem(HDFS). Now, our plan
is implementing single client appending and truncating based on hadoop
0.15.0 version.

Our implementation will follow the desigh of Hadoop-1700. In the
latest version of Hadoop-1700,
we found some points (or bugs) to be corrected/improved. Here are the points:

0. We recommend: the BlockID don't adopt random 64bit ID. We give a
global variable NextBlockID.
The NextBlockID is 64bit, initialized to 0, is recorded into
transaction log. When we need a new BlockID,
we give NextBlockID as the new BlockID, and the NextBlockID is
incremented by 1. The new NextBlockID
is recorded into transaction log.

1. In section The Writer of Hadoop-1700, the original text is:

____________________________________________________________________________________________________________________________
..........
The Writer requests the Namenode to create a new file or open an
existing file with an intention of appending to it.
The Namenode generates a new blockId and a new GenerationStamp for
this block. Let's call the GenerationStamp that
is associated with a block as BlockGenerationStamp. A new
BlockGenerationStamp is generated by incrementing the
global GenerationStamp by one and storing the global GenerationStamp
back into the transaction log. It records
the blockId, block locations and the BlockGenerationStamp in the
BlocksMap. The Namenode returns the blockId,
BlockGenerationStamp and block locations to the Client.
.........
_______________________________________________________________________________________________________________________________
our comment:
The Writer requests the Namenode to create a new file or open an
existing file with an intention of appending to it.
The Namenode generates a new blockId and a new GenerationStamp for
this block. Let's call the GenerationStamp that
is associated with a block as BlockGenerationStamp. A new
BlockGenerationStamp is generated by incrementing the
global GenerationStamp by one and storing the global GenerationStamp
back into the transaction log. ____If the block is
a new block, the Namenode returns the blockID, BlockGenerationStamp
and block locations to the client. The namenode don't
records the blockID, BlockLocations, BlockGenerationStamp in
BlocksMap. If the block is an existing block
(the last block is not full), we generate a new BlockGenerationStamp
and return the BlockID, BlockLocations, Old BlockGenerationStamp,
new BlockGenerationStamp to the client, the namenode don't records the
new BlockGenerationStamp._____

When client successfully updates the all datanodes where the replicas
of block is located in, the client tells namenode that
he has successfully updated the datanodes, along with the
information (BlockID, BlockLocations, new BlockGenerationStamp).
Then, Namenode records the blockId, block locations and the new
BlockGenerationStamp in the BlocksMap, and creates the OpenFile
transaction log.
____________________________________________________________________________________________________________________________-

  • notes: our method can tolerate the following failure. If namenode
    records the new BlockGenerationStamp in BlocksMap, when client don't
    update the
    datanodes or client (writer) crashes, namenode will
    start lease recovery. At this moment, replica version may be smaller
    than the version
    recorded in namenode. So, namenode will discard the
    replicas. It is not our expected result. Our method can tolerate this
    failure.

2. In hadoop-1700, OpenFile transaction log records all the blocklist.
If client invokes flush operation frenquently, maybe the overhead of
namenode is very heavy. So we adopt the following method:
we only record the varying block, don't record the blocks which
are not being modified.
If a new block is created, we record the blockID,
BlockGenerationStamp. If the BLockGenerationStamp of a block is
modified, we
only record the new BlockGenerationStamp for the block.
By this method, we reduce the overhead of namenode when namenode
creates the OpenFile transaction log.

We look forward to getting help from Hadoop Community. Your any advice
will be appreciated.


dhruba borthakur added a comment - 30/Nov/07 05:38 AM
Thanks Ruyue for your comments.

0. Regarding your comment about using a sequential block-id, instead of a random block-id: How do we uprade existing clusters? An existing cluster can have huge number of already allocated blocks.

1. Your comment about Point 1 makes sense. i will update the document.

2. Regarding Point 2: a file typically has few blocks, ranging from somewhere between 2 to 10 blocks. I like your proposal to optimize transaction logging, especially when the number of blocks in a file are huge. Is it possible to consider it as an enhancement and implement it after implementing HDFS Apends? Won't the system be simpler if we avoid this optimization at first go and then once HDFS Append is committed, then we can work on this optimization?


Ruyue Ma added a comment - 26/Dec/07 02:49 AM
this patch gives a simple append implementation based on 0.15.1 version. it is completely compatible with the 0.15.1. It adds a new open() function in DFSClient class.

Usage:

1. open for read

InputStream is = (FSDataInputStream)dfsclient.open(path, buffersize, "r");

2. open for append

OutputStream os = (FSDataOutputStream)dfsclient.open(path, buffersize, "a");

3. open for append (if no file, create it)

OutputStream os = (FSDataOutputStream)dfsclient.open(path, buffersize, "a+c");


Ruyue Ma added a comment - 26/Dec/07 02:54 AM
I have given a new HDFS VFS implementation via fuse-j. This VFS is based on our append implementation.
If any one need it , please send mail to me. My mail is ruyue.ma@gmail.com
The vfs implementation works correctly.

Ruyue Ma added a comment - 26/Dec/07 02:57 AM
our append.patch only simple. The goal is trying to keep compatible with the 0.15.1.
  • It dosen't support fine-grained client flush.
  • I use a sequential block-id, instead of a random block-id, so no blockGenerationStamp is added.

Ruyue Ma added a comment - 26/Dec/07 03:01 AM
Our next patch will give complete append implementation according to Hadoop 1700

Ruyue Ma added a comment - 26/Dec/07 03:16 AM
this patch gives a simple append implementation based on 0.15.1 version. it is completely compatible with the 0.15.1. It adds a new open() function in DFSClient class.

Usage:

1. open for read

InputStream is = (FSDataInputStream)dfsclient.open(path, buffersize, "r");

2. open for append

OutputStream os = (FSDataOutputStream)dfsclient.open(path, buffersize, "a");

3. open for append (if no file, create it)

OutputStream os = (FSDataOutputStream)dfsclient.open(path, buffersize, "a+c");
[ Show » ]
Ruyue Ma - 25/Dec/07 06:49 PM this patch gives a simple append implementation based on 0.15.1 version. it is completely compatible with the 0.15.1. It adds a new open() function in DFSClient class. Usage: 1. open for read InputStream is = (FSDataInputStream)dfsclient.open(path, buffersize, "r"); 2. open for append OutputStream os = (FSDataOutputStream)dfsclient.open(path, buffersize, "a"); 3. open for append (if no file, create it) OutputStream os = (FSDataOutputStream)dfsclient.open(path, buffersize, "a+c");


stack added a comment - 26/Dec/07 06:05 PM
Ruyue:

I know little of hdfs internals but here are a few comments on the patch.

Looks like you made your patch with diff going between a java-old and a java-new directory. Can you makie it instead by doing a 'svn diff' inside the top directory of an hadoop checkout? You should also make the patch apply to TRUNK. Currently it does not. Its unlikely that your patch will be added to the 0.15.x branch since usually only bug fixes are made against released versions and as is, it makes it hard for me to take your patch for a test drive.

Tabs should be purged and replaced by spaces to make your patch palatable.

New classes could do with javadoc explaining their purpose: e.g. BasicAppendInfo class comment could explain where its used and the payload it carries.

New methods added to the protocol could also do with javadoc: e.g. baseAppendFile. Also, I'd guess that as is, your patch will emit javadoc warnings which will make it so your patch will fail the patch-build (your createFile method does not detail the method parameters). You also have an instance of empty javadoc.

Your getting of blockid is synchronized but not on setting.


dhruba borthakur added a comment - 04/Jan/08 12:04 AM
Hi Ruyue,

I like all of the preceding review comments by stack. I have two basic questions.

1. If a client has opened a file for "append" and is writing to the end of the file. The last block of the file has three replicas. It can so happen that the system fails to write the data to some replicas. What is your error recovery policy? All replicas would be of different size, can you pl point me to the code that handles this inconsistency? I cannot find this code in your patch.

2. The DataNode, while finalizing a block, sends the block-size to the NameNode. In the case of "appends", the block size of the block changes with every write. Can you pl point me to the code that makesthe necessary changes to the Namenode's data structure that reflects the new-size after the append is finished? I could not find this code in your patch.

thanks,
dhruba


Chad Walters added a comment - 18/Jan/08 04:43 AM
What's the status on this patch? Is there anything that can be done to move this along? This is long-awaited functionality on many fronts. There was also some mention by Ruyue of a more complete patch coming soon. Would be willing to help out in any way possible.

dhruba borthakur added a comment - 18/Jan/08 06:03 PM
Hello Chad, I am currently working on HADOOP-2345. I created additional JIRAs (HADOOP-2655, HADOOP-2656) that are pre-requisites for this one. Would appreciate any help in addressing these JIRAs.

dhruba borthakur added a comment - 03/Feb/08 09:31 AM
I am canceling this patch because I am not convinced that the current patch does error recovery.

stack added a comment - 04/Mar/08 07:46 PM
Was wondering if any updated status on this patch. We're having an HBase User Group meeting this evening and I'm sure there'll be an inquiry as to state of this issue. Thanks.

dhruba borthakur added a comment - 04/Mar/08 08:50 PM
Most of the JIRA linked to this one are fixed or have patches associated with them. That means mos tof the pre-resuisites for supporting appends are already done. i am currently working on a distributed upgrade to upgrade a current cluster to the new disk format (HADOOP-2656).

Once the above pre-requisites are done, the new code for supporting Appends will be started. I am guessing it will be another 1.5 man month before it appears as a draft patch.


Doug Cutting added a comment - 14/Apr/08 06:52 PM
Should we add another blocking issue to extend the FileSystem API to support append? This might just be adding a new method, 'FSDataOutputStream append(Path)' that throws UnsupportedOperationException by default. We could implement it in the local filesystem then too.

This might be useful to applications (like HADOOP-3245) that might want to start writing append-aware code before HDFS append is available. Thoughts?


dhruba borthakur added a comment - 14/Apr/08 07:53 PM
Creating a new JIRA to decide on the Append API sounds great. I will create a JIRA and link it with H-1700.

Doug Cutting added a comment - 23/Apr/08 08:38 PM
HDFS's cache of FileStatus in each Path (HADOOP-2565) should be removed before append is implemented.

Sameer Paranjpye added a comment - 07/May/08 03:55 PM
Dhruba, need some guidance here. Do you intend to complete work on append? Are there any append related patches ready for review or commit that you need help with?

Robert Chansler added a comment - 07/May/08 10:53 PM
Attached a record of some of the discussions of how to begin using sequential block IDs.
Grid_HadoopRenumberBlocks.pdf

dhruba borthakur added a comment - 12/May/08 04:33 AM
Hi Sameer, I would like to complete the "append" work as described in the design paper associated with this JIRA. To achieve this, I would like to get HADOOP-2656 committed. This JIRA adds a generation stamp to the Block object and is a blocker for all succeeding append-related JIRAs.

Then, Nicholas's HADOOP-3310 should be committed followed by HADOOP-3113 that I am currently working on.


Tsz Wo (Nicholas), SZE added a comment - 13/May/08 12:17 AM
> To achieve this, I would like to get HADOOP-2656 committed.

I think HADOOP-2656 is ready to be committed.


Sameer Paranjpye added a comment - 13/May/08 06:20 PM
> Hi Sameer, I would like to complete the "append" work as described in the design paper associated with this JIRA

Awesome! Thanks Dhruba. Nicholas will work with you to take this forward.


Robert Chansler added a comment - 13/May/08 10:59 PM
The current generation stamp implementation should be committed. I recorded the discussion of serial block IDs for the curious, but I don't propose changing directions. HADOOP-2656 should go!

Tsz Wo (Nicholas), SZE added a comment - 06/Jun/08 10:10 PM - edited
Append is almost ready.

1700_20080606.patch: implemented DistributedFileSystem.append(...).

TODO:

  • trigger lease recovery when a lease already exists and is expired
  • initialize write to the end of a file

dhruba borthakur added a comment - 08/Jun/08 11:20 PM
The patch looks good. This patch introduces the changes to the ClientProtocol. More changes will be required for the data transfer protocol between the client and datanode to implement the full functionality. I documented one of the major remaining issues in HADOOP-3515.

Till we develop a patch for HADOOP-3515, we either hold off on this patch or check this patch in by making appendFileInternal throw an unsupported exception. If vote for creating a new JIRA to check in this patch but leave HADOOP-1700 open till we get a complete solution for appends.


Tsz Wo (Nicholas), SZE added a comment - 09/Jun/08 09:37 PM
So HADOOP-3515 takes care "initialize write to the end of a file".

For "trigger lease recovery when a lease already exists and is expired", I think NameNode could throw an exception for Lease expired soft limit, then the client can go ahead to trigger the lease recovery.

The patch also needs some tests.


dhruba borthakur added a comment - 29/Jun/08 10:49 AM
Finally, the last patch for supporting "append" on HDFS. This patch supports a single simultaneous appender to a file.

I purposely left in a few debug prints for now. This version of the patch is now available for review.


dhruba borthakur added a comment - 07/Jul/08 10:58 PM
This is the final patch for supporting appends. It has an unit test that randomly triggers random-size appends to file of random size with random replication factor.

Tsz Wo (Nicholas), SZE added a comment - 08/Jul/08 08:34 PM
appendtrunk6.patch does not apply anymore. Could you update it, Dhruba?

dhruba borthakur added a comment - 08/Jul/08 10:36 PM
Merged patch with latest trunk.

Tsz Wo (Nicholas), SZE added a comment - 10/Jul/08 02:43 AM
Not able to review the whole patch. Below are something I found so far. Updated to appendtrunk8.patch.

Major changes:

  • Checking permission for append in FSNamesystem
  • Surround reading steam codes by a try-finally in computePartialChunkCrc(...)
  • Making BlockInputStreams implements Closeable
  • Re-arranged the DFSOutputStream constructors for reducing code duplication
  • Re-arranged some checksum codes for reducing code duplication

Minor changes:

  • throwing IOException("Not supported") in SimulatedFSDataset.getTmpInputStreams(...)
  • Renamed BlockReadStreams to BlockInputStreams.
  • added StringUtils.byteToHexString(byte[] bytes, int start, int end)
  • added java doc in DFSClient.append(...)
  • Checking LOG.isDebugEnabled() before calling LOG.debug(...)

Tsz Wo (Nicholas), SZE added a comment - 14/Jul/08 09:09 PM
appendtrunk9.patch: added two more tests for testing appending to an non-existing file and permission checking

However, TestFileAppend2 currently is failed, even for trunk + appendtrunk7.patch. Could you check it, Dhruba?


dhruba borthakur added a comment - 14/Jul/08 09:38 PM
Thanks for the tests and associated code-changes Nicholas. I will look into why the unit test failed on your setup.

Tsz Wo (Nicholas), SZE added a comment - 15/Jul/08 12:56 AM
appendtrunk10.patch: fixed a bug. Passed TestFileAppend2 on Linux but there is a rename problem in Windows.

Tsz Wo (Nicholas), SZE added a comment - 15/Jul/08 06:21 PM
appendtrunk11.patch: fixed the rename problem on Windows. The only test it fails is TestDatanodeDeath. It also fails on trunk + appendtrunk7.patch

Tsz Wo (Nicholas), SZE added a comment - 15/Jul/08 07:03 PM
appendtrunk12.patch: removed some debug codes

dhruba borthakur added a comment - 16/Jul/08 01:39 AM
This fixes the test failure with TestDatanodeDeath.

dhruba borthakur added a comment - 16/Jul/08 08:15 PM
Added TestFileAppend2 unit test.

dhruba borthakur added a comment - 16/Jul/08 09:44 PM
Fixed javac and javadoc warnings. This probably has two Findbugs warnings still to be fixed.

dhruba borthakur added a comment - 16/Jul/08 10:04 PM
Incorporated most review comments. Let's try HadoopQA tests.

dhruba borthakur added a comment - 17/Jul/08 12:40 AM
Fixed findbugs warnings. Many thanks to Nicholas for helping me run findbugs.

Hadoop QA added a comment - 17/Jul/08 02:27 PM
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12386249/appendtrunk14.patch
against trunk revision 677470.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 9 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2888/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2888/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2888/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2888/console

This message is automatically generated.


dhruba borthakur added a comment - 17/Jul/08 04:34 PM
Ok, got +1 from hadoopQA. I am planning to check this into trunk by the end of this week!

Konstantin Shvachko added a comment - 17/Jul/08 05:48 PM
This is a large patch, and we don't have a test plan or any evidence that the feature was thoroughly tested.
The unit test includes two unit tests
  1. Create one file, close, reopen and append.
  2. Run many appends in parallel to many files.

What really need to be tested is different failure scenarios, like when the client or one of data-nodes in the pipeline fail
at different stages of the transaction, and how name-node reacts to this. Or what happens if name-node fails in the middle
of appends and restarted, etc. As we have seen with the lease recovery feature all these things can cause serious problems
and should be tested in details.
So as stated in HADOOP-2658 a test plan should be both designed and implemented before such a big feature can be committed.
Imo it should include a lot more unit tests.

Briefly looked at the code, noticed that you forgot to increment the ClientDatanodeProtocol.versionID although the comment is there.


Robert Chansler added a comment - 17/Jul/08 05:55 PM
This is a substantial new feature, and 0.18 is still waiting on the fixes to previous changes needed to facilitate append.

Konstantin's previous comment has some useful detail about necessary unit tests. Also there is a necessity for at-scale functional testing.

Has there been any functional testing from application programs? At challenging load levels? In competition with other activity?

Has performance testing established that append changes do not have an adverse impact on other operations?

Does the design documentation need to be updated since November?

What about user documentation?


Doug Cutting added a comment - 17/Jul/08 06:08 PM
As a new feature, the testing requirements are somewhat lower, no? Existing tests should test existing functionality, so we should have some confidence that this won't break existing applications. If new applications that use the new feature are broken then that's less of a problem. So we ought to perhaps do our normal stress-testing (e.g., sort900) before committing this, to ensure that existing features still work well.

Things are actually a bit more complex, since, e.g., a new application that uses this new feature might crash a namenode that's shared with an existing application. One way to address this concern might be to permit the new feature (appends) to be disabled.

That said, I think more thorough tests would be good to have, I just question whether they need to hold up committing this. Getting features into trunk earlier, well before a freeze, is also a good way to find problems, problems that might not be found with synthetic tests.

So, my concrete proposal: Before we commit this we should,

  • make sure that large-scale tests still run well
  • add a flag to disable appends, in case they introduce namenode or datanode instabilities
  • add a 0.19 blocker that adds more append tests
  • add a 0.19 blocker that adds better append documentation

Does that sound reasonable?


Nigel Daley added a comment - 18/Jul/08 06:46 AM
Dhruba, wasn't there a test plan for this at one point? I can't seem to find it.

Doug's comments seem reasonable if someone is going to sign up to fix those blockers.

In addition to Doug's list, I'd add that a test plan must be published for this feature.


dhruba borthakur added a comment - 18/Jul/08 05:13 PM
I agree with Doug. Also, this patch is pretty light-weight compared to earlier append-related patches. Most of the heavy-lifting has been done in earlier patches that are part of 0.18. This patch does not change any disk format(s).

Making a configurable would have been an option, I will do it as part of this JIRA.

Regarding test plan, there is a JIRA opened in Jan 08: http://issues.apache.org/jira/browse/HADOOP-2658. Also, large-scale functional and performance testing would have to be done at a site that has plenty machines. I would request Nigel/Mukund to take a initial-stab at these two items.

Opened new JIRA to add more append-related tests. http://issues.apache.org/jira/browse/HADOOP-3790. I have assigned this to myself.


dhruba borthakur added a comment - 18/Jul/08 05:14 PM
Introduce a configuration option to support/disable hdfs appends.

Robert Chansler added a comment - 19/Jul/08 01:06 AM
The answer to Doug's question is, "Not necessarily." But, I'm in general agreement with his proposal with the understanding that the new blockers are assigned in advance.

dhruba borthakur added a comment - 19/Jul/08 01:25 AM
If dfs.support.append is set to false, then the "append" feature to hdfs will be switched off.

Hadoop QA added a comment - 19/Jul/08 11:41 AM
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12386453/appendtrunk15.patch
against trunk revision 678080.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 9 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2913/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2913/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2913/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2913/console

This message is automatically generated.


dhruba borthakur added a comment - 20/Jul/08 05:21 AM
Hi Nigel/Mukund,

I understand that large-scale performance testing and a test-plan are two things that are keeping this issue from being checked in. Would appreciate it a lot if any of you can help in this regard. thanks in advance,

dhruba


Konstantin Shvachko added a comment - 21/Jul/08 06:30 PM
As far as I understood the latest discussion the prerequisites for this patch is the Doug's proposal plus the test plan.
I'll repeat them here.
  • make sure that large-scale tests still run well
  • add a flag to disable appends, in case they introduce namenode or datanode instabilities (done)
  • add a 0.19 blocker that adds more append tests
  • add a 0.19 blocker that adds better append documentation
  • Develop and implement test plan (HADOOP-2658)

The biggest issues are the large scale test and the test plan.
I think if we implement the test plan the large scale test would go much smoother.
How large is large enough? A 100 would do it imo.


Doug Cutting added a comment - 21/Jul/08 08:17 PM
> Develop and implement test plan (HADOOP-2658)

I thought we could commit 1700 before we implement all the tests, provided the tests were a 0.19 blocker and folks were committed to implement them ASAP.


Sameer Paranjpye added a comment - 22/Jul/08 08:26 PM
+1

Dhruba, if you can author a test plan in HADOOP-2658, we can work with you to implement and run the tests at scale.


dhruba borthakur added a comment - 22/Jul/08 08:32 PM
Ok, i will author a test plan. Will post the test plan to HADOOP-2658.

dhruba borthakur added a comment - 23/Jul/08 01:20 PM
I had an offline discussion with Nicholas, Sameer, Rob Chansler and Nigel. We agreed that we will open a couple od JIRAs (marked as blockers for 0.19), these will be related to documentation, writing more unit tests, creating a append-benchmark and running large-scale tests to validate existing behaviour and performance. Rob has volunteered to create these JIRAs. Once these are created, we should check in this patch.

This means that it is likely that this patch gets checked into trunk in a day or two.


Tsz Wo (Nicholas), SZE added a comment - 23/Jul/08 06:43 PM
+1 appendtrunk15.patch looks good

Konstantin Shvachko added a comment - 23/Jul/08 08:29 PM
ClientDatanodeProtocol version is still 1. Should be 2.

dhruba borthakur added a comment - 24/Jul/08 01:24 PM
Incorporated Konstantin's review comments.

dhruba borthakur added a comment - 24/Jul/08 01:24 PM
Rerun HadoopQA one final time!

dhruba borthakur added a comment - 24/Jul/08 01:28 PM
Rerun HadoopQA one final time!

Hadoop QA added a comment - 25/Jul/08 03:00 AM
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12386795/appendtrunk16.patch
against trunk revision 679601.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 9 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2944/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2944/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2944/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2944/console

This message is automatically generated.


dhruba borthakur added a comment - 25/Jul/08 06:07 PM
Appends to HDFS is finally committed! Many thanks to many many people who contributed to the design, development and testing. Special thanks to Nicholas who developed some key pieces related to this patch!

dhruba borthakur added a comment - 25/Jul/08 06:09 PM
I just committed this!

stack added a comment - 25/Jul/08 08:21 PM
Whoopdee!

Hudson added a comment - 22/Aug/08 12:34 PM