Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.15.1
    • Fix Version/s: 0.19.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Incompatible change, Reviewed
    • Release Note:
      Hide
      Introduced append operation for HDFS files.
      Show
      Introduced append operation for HDFS files.

      Description

      Request for being able to append to files in HDFS has been raised a couple of times on the list of late. For one example, see http://www.nabble.com/HDFS%2C-appending-writes-status-tf3848237.html#a10916193. Other mail describes folks' workarounds because this feature is lacking: e.g. http://www.nabble.com/Loading-data-into-HDFS-tf4200003.html#a12039480 (Later on this thread, Jim Kellerman re-raises the HBase need of this feature). HADOOP-337 'DFS files should be appendable' makes mention of file append but it was opened early in the life of HDFS when the focus was more on implementing the basics rather than adding new features. Interest fizzled. Because HADOOP-337 is also a bit of a grab-bag – it includes truncation and being able to concurrently read/write – rather than try and breathe new life into HADOOP-337, instead, here is a new issue focused on file append. Ultimately, being able to do as the google GFS paper describes – having multiple concurrent clients making 'Atomic Record Append' to a single file would be sweet but at least for a first cut at this feature, IMO, a single client appending to a single HDFS file letting the application manage the access would be sufficent.

      1. 1700_20080606.patch
        14 kB
        Tsz Wo Nicholas Sze
      2. append.patch
        22 kB
        Ruyue Ma
      3. append3.patch
        55 kB
        dhruba borthakur
      4. Appends.doc
        76 kB
        dhruba borthakur
      5. Appends.doc
        70 kB
        dhruba borthakur
      6. Appends.html
        45 kB
        dhruba borthakur
      7. appendtrunk10.patch
        81 kB
        Tsz Wo Nicholas Sze
      8. appendtrunk11.patch
        80 kB
        Tsz Wo Nicholas Sze
      9. appendtrunk12.patch
        80 kB
        Tsz Wo Nicholas Sze
      10. appendtrunk13.patch
        79 kB
        dhruba borthakur
      11. appendtrunk13.patch
        79 kB
        dhruba borthakur
      12. appendtrunk13.patch
        64 kB
        dhruba borthakur
      13. appendtrunk14.patch
        79 kB
        dhruba borthakur
      14. appendtrunk14.patch
        80 kB
        dhruba borthakur
      15. appendtrunk15.patch
        80 kB
        dhruba borthakur
      16. appendtrunk16.patch
        80 kB
        dhruba borthakur
      17. appendtrunk6.patch
        73 kB
        dhruba borthakur
      18. appendtrunk7.patch
        73 kB
        dhruba borthakur
      19. appendtrunk8.patch
        76 kB
        Tsz Wo Nicholas Sze
      20. appendtrunk9.patch
        79 kB
        Tsz Wo Nicholas Sze
      21. Grid_HadoopRenumberBlocks.pdf
        74 kB
        Robert Chansler

        Issue Links

          Activity

          Hide
          Jim Kellerman added a comment -

          For HBase, we only need a single appender.

          Additionally, since we would do a batch of appends at one time, we would be happy if the appends did not become visible until we called flush() or something like that. We certainly do not have a requirement that every append become visible immediately.

          Show
          Jim Kellerman added a comment - For HBase, we only need a single appender. Additionally, since we would do a batch of appends at one time, we would be happy if the appends did not become visible until we called flush() or something like that. We certainly do not have a requirement that every append become visible immediately.
          Hide
          Doug Cutting added a comment -

          I think the easiest way to implement appends (and other file modifications) is to keep blocks immutable, but permit one to alter the block list of a file. The simplest implementation of this would require the client, when appending, to re-read the entire last block of a file and then re-write a longer version it as a new block. This could perhaps be optimized by adding a datanode command that permitted one to start writing a new block on a datanode whose initial contents is that of another block, then append new data to that block. That datanode could chain the append to the other replicas.

          Show
          Doug Cutting added a comment - I think the easiest way to implement appends (and other file modifications) is to keep blocks immutable, but permit one to alter the block list of a file. The simplest implementation of this would require the client, when appending, to re-read the entire last block of a file and then re-write a longer version it as a new block. This could perhaps be optimized by adding a datanode command that permitted one to start writing a new block on a datanode whose initial contents is that of another block, then append new data to that block. That datanode could chain the append to the other replicas.
          Hide
          Raghu Angadi added a comment -

          I am not sure if this is simpler than allowing the last block to be expanded by datanodes. Is there is any advantage of making a block immutable (other than implementation)?

          Show
          Raghu Angadi added a comment - I am not sure if this is simpler than allowing the last block to be expanded by datanodes. Is there is any advantage of making a block immutable (other than implementation)?
          Hide
          Doug Cutting added a comment -

          > Is there is any advantage of making a block immutable [ ... ]

          Consider a datanode that's offline during an append and then comes online later. How do we discover that it's replica of a block is out of date? Block identity would have to become <id,length>. And that would only work for append, but not for other block modifications.

          Show
          Doug Cutting added a comment - > Is there is any advantage of making a block immutable [ ... ] Consider a datanode that's offline during an append and then comes online later. How do we discover that it's replica of a block is out of date? Block identity would have to become <id,length>. And that would only work for append, but not for other block modifications.
          Hide
          Sameer Paranjpye added a comment -

          > Consider a datanode that's offline during an append and then comes online later [ ... ]

          <id, length> would work for both append and truncate, it wouldn't work for random writes into a block. A block could have a generation number that gets incremented every time the block is accessed for modification. The generation number would be recorded by the Namenode, which could discard old copies of blocks.

          Show
          Sameer Paranjpye added a comment - > Consider a datanode that's offline during an append and then comes online later [ ... ] <id, length> would work for both append and truncate, it wouldn't work for random writes into a block. A block could have a generation number that gets incremented every time the block is accessed for modification. The generation number would be recorded by the Namenode, which could discard old copies of blocks.
          Hide
          Sameer Paranjpye added a comment -

          Making a block immutable makes appends and truncates very heavyweight. You have to copy on average blocksize*2*num_replicas bytes to make any sort of modification.

          Show
          Sameer Paranjpye added a comment - Making a block immutable makes appends and truncates very heavyweight. You have to copy on average blocksize*2*num_replicas bytes to make any sort of modification.
          Hide
          Doug Cutting added a comment -

          > A block could have a generation number that gets incremented every time the block is accessed for modification.

          But isn't <blockid+generation> really tantamount to a new block id? How is this any different from simply hard-linking to the old block file and then modifying it?

          > Making a block immutable makes appends and truncates very heavyweight. You have to copy on average blocksize*2*num_replicas bytes to make any sort of modification.

          The copies could be local on each datanode, not across the wire. They could be made to be from one drive to another, or in the case of append, you might hard link to the old block file if the client is sure to never read past block end. So append need not be heavy weight at all.

          If you're willing to possibly break existing readers while truncating, then you could always hard link block files, and never perform any copying. The file could be corrupted, e.g., if a truncate succeeds on the datanode but the block list on the namenode is not updated to reflect that, or vice versa, but I think the blockid+revision approach has exactly the same issue. Copying could prevent such issues, making modifications atomic, but at some cost to performance.

          Show
          Doug Cutting added a comment - > A block could have a generation number that gets incremented every time the block is accessed for modification. But isn't <blockid+generation> really tantamount to a new block id? How is this any different from simply hard-linking to the old block file and then modifying it? > Making a block immutable makes appends and truncates very heavyweight. You have to copy on average blocksize*2*num_replicas bytes to make any sort of modification. The copies could be local on each datanode, not across the wire. They could be made to be from one drive to another, or in the case of append, you might hard link to the old block file if the client is sure to never read past block end. So append need not be heavy weight at all. If you're willing to possibly break existing readers while truncating, then you could always hard link block files, and never perform any copying. The file could be corrupted, e.g., if a truncate succeeds on the datanode but the block list on the namenode is not updated to reflect that, or vice versa, but I think the blockid+revision approach has exactly the same issue. Copying could prevent such issues, making modifications atomic, but at some cost to performance.
          Hide
          Sameer Paranjpye added a comment -

          > But isn't <blockid+generation> really tantamount to a new block id?

          Not really, the implications for implementation are pretty different. If a new block id is to be used, the Namenode has to allocate a new block and delete the old block. Scheduling the old blocks replicas for deletion, dispatching the requests and journaling the new block is a non-trivial amount of Namenode activity. A revision number update can simply be recorded in memory. In the event of a conflict the Namenode would treat the highest revision numbered replicas as valid and discard out of date replicas.

          > Copying could prevent such issues [ ... ]

          Copying does make error handling somewhat easier. But it seems to me that it does so only when changes to a file are exposed in the Namenode at a block granularity. If we want to make changes visible at a finer grain both approaches have similar complexity in the corner cases of datanodes and writers crashing in the middle of updates.

          Show
          Sameer Paranjpye added a comment - > But isn't <blockid+generation> really tantamount to a new block id? Not really, the implications for implementation are pretty different. If a new block id is to be used, the Namenode has to allocate a new block and delete the old block. Scheduling the old blocks replicas for deletion, dispatching the requests and journaling the new block is a non-trivial amount of Namenode activity. A revision number update can simply be recorded in memory. In the event of a conflict the Namenode would treat the highest revision numbered replicas as valid and discard out of date replicas. > Copying could prevent such issues [ ... ] Copying does make error handling somewhat easier. But it seems to me that it does so only when changes to a file are exposed in the Namenode at a block granularity. If we want to make changes visible at a finer grain both approaches have similar complexity in the corner cases of datanodes and writers crashing in the middle of updates.
          Hide
          Doug Cutting added a comment -

          > A revision number update can simply be recorded in memory.

          So the namenode wouldn't persist the block version number, it would just keep track of the highest revision that's yet been reported to it? If for any reason all of the replicas are not updated to the latest revision, the namenode would still need to issue replication and deletion commands, right? But I think you're arguing that in what's hopefully the common case, when all replicas are sucessfully modified, the namenode need only increment its in-memory block revision number and would not need to issue any other commands nor persist any data. That would indeed be a namenode performance advantage to this approach.

          One cost of this approach is that block revision numbers would consume more memory per block on the namenode, something we wish to minimize. So there is something of a time/space tradeoff.

          Some edge semantics would be different too. If only one replica is updated, and its datanode dies, and the namenode is restarted, then the change would silently be lost, no? With new block ids, this would result in a file with a missing block that could not be read.

          Show
          Doug Cutting added a comment - > A revision number update can simply be recorded in memory. So the namenode wouldn't persist the block version number, it would just keep track of the highest revision that's yet been reported to it? If for any reason all of the replicas are not updated to the latest revision, the namenode would still need to issue replication and deletion commands, right? But I think you're arguing that in what's hopefully the common case, when all replicas are sucessfully modified, the namenode need only increment its in-memory block revision number and would not need to issue any other commands nor persist any data. That would indeed be a namenode performance advantage to this approach. One cost of this approach is that block revision numbers would consume more memory per block on the namenode, something we wish to minimize. So there is something of a time/space tradeoff. Some edge semantics would be different too. If only one replica is updated, and its datanode dies, and the namenode is restarted, then the change would silently be lost, no? With new block ids, this would result in a file with a missing block that could not be read.
          Hide
          Doug Cutting added a comment -

          Thinking a bit more, it seems to me that with the new block-id approach, one could also manage things so that, when there are no failures, the namenode need not schedule deletions. Where in one approach the datanode would report the new block revision number, in the other approach it could report the old block id that was removed from that datanode. So the primary differences I now see between the approaches are just persistence and namenode memory size.

          Show
          Doug Cutting added a comment - Thinking a bit more, it seems to me that with the new block-id approach, one could also manage things so that, when there are no failures, the namenode need not schedule deletions. Where in one approach the datanode would report the new block revision number, in the other approach it could report the old block id that was removed from that datanode. So the primary differences I now see between the approaches are just persistence and namenode memory size.
          Hide
          Sameer Paranjpye added a comment -

          > One cost of this approach is that block revision numbers would consume more memory per block on the namenode, something we wish to minimize. So there is
          > something of a time/space tradeoff.

          We're saving hundreds of bytes with HADOOP-1687, what's the use of doing that if we don't use them

          But seriously, the space cost is very small, an additional 4-8 bytes/block. In all of the reduction in memory usage proposed in HADOOP-1687, there has been very little culling of file and block metadata. Most of the savings come from the use of better containers, sorted lists vs treemaps, for instance and from limiting the total number of Java objects the Namenode has in memory. My point is that the data members of file and block objects have not been the source of our memory problems and are unlikely to be so even if we add a few dozen bytes to them.

          > Thinking a bit more, it seems to me that with the new block-id approach, one could also manage things so that [ ... ]

          That could be certainly be done.

          One issue with new block ids though is that if a datanode fails in the middle of an update, it would have both the old and new blocks, say <oldid> and <newid>. Replicas where the update succeeded would remove the old block and inform the Namenode, which would remove it from it's blocks map. The old block would remain on the failed datanode and be orphaned. If the Namenode then assigned this <oldid> to a new file, there would be a block-id collision. This problem already exists to some degree. If a file is deleted and a datanode containing that files blocks fails then these orphans remain on the failed node and a collision can happen when it comes back.

          If we use a revision number that takes the form of a timestamp, it can be used to distinguish not only out of date replicas of a currently existing file but also those from old long deleted files. Two birds and all that ...

          Show
          Sameer Paranjpye added a comment - > One cost of this approach is that block revision numbers would consume more memory per block on the namenode, something we wish to minimize. So there is > something of a time/space tradeoff. We're saving hundreds of bytes with HADOOP-1687 , what's the use of doing that if we don't use them But seriously, the space cost is very small, an additional 4-8 bytes/block. In all of the reduction in memory usage proposed in HADOOP-1687 , there has been very little culling of file and block metadata. Most of the savings come from the use of better containers, sorted lists vs treemaps, for instance and from limiting the total number of Java objects the Namenode has in memory. My point is that the data members of file and block objects have not been the source of our memory problems and are unlikely to be so even if we add a few dozen bytes to them. > Thinking a bit more, it seems to me that with the new block-id approach, one could also manage things so that [ ... ] That could be certainly be done. One issue with new block ids though is that if a datanode fails in the middle of an update, it would have both the old and new blocks, say <oldid> and <newid>. Replicas where the update succeeded would remove the old block and inform the Namenode, which would remove it from it's blocks map. The old block would remain on the failed datanode and be orphaned. If the Namenode then assigned this <oldid> to a new file, there would be a block-id collision. This problem already exists to some degree. If a file is deleted and a datanode containing that files blocks fails then these orphans remain on the failed node and a collision can happen when it comes back. If we use a revision number that takes the form of a timestamp, it can be used to distinguish not only out of date replicas of a currently existing file but also those from old long deleted files. Two birds and all that ...
          Hide
          Doug Cutting added a comment -

          > If the Namenode then assigned this <oldid> to a new file, there would be a block-id collision.

          As you observe, this also can happen if, e.g., a datanode is offline when a file is deleted, etc. So I don't think using new blockids to implement appends really makes this that much more likely.

          The primary advantage you've presented that versioning blocks has over allocating new blockids is that version numbers might not have to be persisted. But versions come at a cost: they take more memory; they have some worrisome edge conditions; and they introduce new concepts into a complex system, rather than simply building on existing, debugged concepts. None of these are fatal, but I don't yet see versioning as the clearly winning strategy.

          > If we use a revision number that takes the form of a timestamp, it can be used to distinguish not only out of date replicas of a currently existing file but also those from old long deleted files.

          I don't follow how this would work. Can you explain more? How would it help in the situation I described above, where only a single replica is updated, its datanode dies, and the namenode is restarted. The file should be corrupt, since the only up-to-date replica of one of its blocks is missing. How would you detect that? Would you declare any file corrupt whose last-modified time was later than that of any of its blocks? That seems fragile.

          Show
          Doug Cutting added a comment - > If the Namenode then assigned this <oldid> to a new file, there would be a block-id collision. As you observe, this also can happen if, e.g., a datanode is offline when a file is deleted, etc. So I don't think using new blockids to implement appends really makes this that much more likely. The primary advantage you've presented that versioning blocks has over allocating new blockids is that version numbers might not have to be persisted. But versions come at a cost: they take more memory; they have some worrisome edge conditions; and they introduce new concepts into a complex system, rather than simply building on existing, debugged concepts. None of these are fatal, but I don't yet see versioning as the clearly winning strategy. > If we use a revision number that takes the form of a timestamp, it can be used to distinguish not only out of date replicas of a currently existing file but also those from old long deleted files. I don't follow how this would work. Can you explain more? How would it help in the situation I described above, where only a single replica is updated, its datanode dies, and the namenode is restarted. The file should be corrupt, since the only up-to-date replica of one of its blocks is missing. How would you detect that? Would you declare any file corrupt whose last-modified time was later than that of any of its blocks? That seems fragile.
          Hide
          Sameer Paranjpye added a comment -

          > As you observe, this also can happen if, e.g., a datanode is offline when a file is deleted, etc. So I don't think using new blockids to implement appends really makes this
          > that much more likely.

          You're right, it doesn't make it much more likely. But it is a problem that exists in the system today and something we likely need a solution for. I don't know the extent of the problem, because on our installations a node that's down for some time usually comes back wiped clean. However, in order to solve this problem we need some disambiguating marker on replicas to distinguish deleted replicas from those currently in use. A checksum would work if it was persisted on the Namenode, the Namenode would only accept replicas whose checksum matched what it had on record. A timestamp would work as well and wouldn't need to be persisted, the Namenode would treat the most recent replicas as valid.

          I'm suggesting that a revision number on a block that takes the form of a timestamp can resolve this issue as well be used to support appends.

          > How would it help in the situation I described above, where only a single replica is updated [ ... ]

          We need the right protocol here. If we have a 2-phase protocol where a writer first updates the revision number on all replicas before it starts writing then this issue doesn't arise. If one replica updates it's revision then dies, and the Namenode restarts, then the Namenode wouldn't see the replica on the dead node and accept the remaining 2 replicas as valid (assuming a replication of 3). In the event of a replica failing, the client would re-apply a revision change before writing, now using a different timestamp. The two remaining replicas would get updated. If the dead datanode then came back it's copy would be rejected. There are lots of corner cases here, but I believe they can be resolved such that we never silently lose data.

          Show
          Sameer Paranjpye added a comment - > As you observe, this also can happen if, e.g., a datanode is offline when a file is deleted, etc. So I don't think using new blockids to implement appends really makes this > that much more likely. You're right, it doesn't make it much more likely. But it is a problem that exists in the system today and something we likely need a solution for. I don't know the extent of the problem, because on our installations a node that's down for some time usually comes back wiped clean. However, in order to solve this problem we need some disambiguating marker on replicas to distinguish deleted replicas from those currently in use. A checksum would work if it was persisted on the Namenode, the Namenode would only accept replicas whose checksum matched what it had on record. A timestamp would work as well and wouldn't need to be persisted, the Namenode would treat the most recent replicas as valid. I'm suggesting that a revision number on a block that takes the form of a timestamp can resolve this issue as well be used to support appends. > How would it help in the situation I described above, where only a single replica is updated [ ... ] We need the right protocol here. If we have a 2-phase protocol where a writer first updates the revision number on all replicas before it starts writing then this issue doesn't arise. If one replica updates it's revision then dies, and the Namenode restarts, then the Namenode wouldn't see the replica on the dead node and accept the remaining 2 replicas as valid (assuming a replication of 3). In the event of a replica failing, the client would re-apply a revision change before writing, now using a different timestamp. The two remaining replicas would get updated. If the dead datanode then came back it's copy would be rejected. There are lots of corner cases here, but I believe they can be resolved such that we never silently lose data.
          Hide
          Doug Cutting added a comment -

          Sameer, can you please succinctly state what you think the relative advantages and disadvantages of using new block ids versus blockids plus non-persistent version timestamps? I've stated what I think these are above, namely that new block ids use slightly less memory, increase the number of persisted events and simplify semantics and recovery. The last item is I believe the most significant.

          Show
          Doug Cutting added a comment - Sameer, can you please succinctly state what you think the relative advantages and disadvantages of using new block ids versus blockids plus non-persistent version timestamps? I've stated what I think these are above, namely that new block ids use slightly less memory, increase the number of persisted events and simplify semantics and recovery. The last item is I believe the most significant.
          Hide
          Sameer Paranjpye added a comment -

          Advantages of revisions:

          • Unified mechanism for supporting appends and detection of stale replicas
          • No need for persistence of revisions

          Disadvantages of revisions:

          • Slightly more memory on the Namenode

          I don't think that using new block ids simplifies the semantics and recovery. I should qualify that, new block ids don't simplify implementation and recovery unless the old blocks are copied on append/truncate. If hard links are used the semantics and recovery scenarios are equally complex since deltas will show up in both copies. There is a not insignificant performance downside if blocks are copied on append/truncate.

          Show
          Sameer Paranjpye added a comment - Advantages of revisions: Unified mechanism for supporting appends and detection of stale replicas No need for persistence of revisions Disadvantages of revisions: Slightly more memory on the Namenode I don't think that using new block ids simplifies the semantics and recovery. I should qualify that, new block ids don't simplify implementation and recovery unless the old blocks are copied on append/truncate. If hard links are used the semantics and recovery scenarios are equally complex since deltas will show up in both copies. There is a not insignificant performance downside if blocks are copied on append/truncate.
          Hide
          Doug Cutting added a comment -

          > Unified mechanism for supporting appends and detection of stale replicas

          I don't yet see how that's simplified by revisions. Doesn't that come for free with new block ids? The current rules don't need to change. A datanode doesn't report to the namenode that it has a block until it has a complete copy (i.e., the append is done). A file isn't closed until all blocks are sufficiently replicated (i.e., minimum replication of modified blocks is complete). If no replicas of a block are available, then the file is unreadable. There are no "stale" replicas to detect.

          Show
          Doug Cutting added a comment - > Unified mechanism for supporting appends and detection of stale replicas I don't yet see how that's simplified by revisions. Doesn't that come for free with new block ids? The current rules don't need to change. A datanode doesn't report to the namenode that it has a block until it has a complete copy (i.e., the append is done). A file isn't closed until all blocks are sufficiently replicated (i.e., minimum replication of modified blocks is complete). If no replicas of a block are available, then the file is unreadable. There are no "stale" replicas to detect.
          Hide
          Sameer Paranjpye added a comment -

          > Doesn't that come for free with new block ids?

          Yes, it does. But it doesn't help with detection of stale replicas from deleted files. I'm saying that revisions can be used to support append and solve the problem of detecting replicas from deleted files.

          Show
          Sameer Paranjpye added a comment - > Doesn't that come for free with new block ids? Yes, it does. But it doesn't help with detection of stale replicas from deleted files. I'm saying that revisions can be used to support append and solve the problem of detecting replicas from deleted files.
          Hide
          eric baldeschwieler added a comment -

          Just wanted to pitch in some context...

          Jim stated in the opening of this bug that a single client writing would be enough to address this issue. I agree. But what we should be clearer about is the ultimate desired semantics for readers. I'd define success as having a single client doing appends and flushes as desired (say per line in a log file) and having multiple clients "tail -f" the file and see updates at a reasonable rate, IE soon after each flush or every 64k bytes or so with less than a seconds latency.

          This would let us build systems that log directly into HDFS and have related systems respond based on those log streams.

          This is where I'd like to see us get with this issue. Clearly getting there involves getting a handle on all the stuff already discussed in this thread. We also need to think carefully about the pipelining and protocol issues involved in making this work.

          We might want to break the protocol change issues into another discussion, but I want to make sure we don't converge on solutions that will not work considering fine grained "flushes".

          Show
          eric baldeschwieler added a comment - Just wanted to pitch in some context... Jim stated in the opening of this bug that a single client writing would be enough to address this issue. I agree. But what we should be clearer about is the ultimate desired semantics for readers. I'd define success as having a single client doing appends and flushes as desired (say per line in a log file) and having multiple clients "tail -f" the file and see updates at a reasonable rate, IE soon after each flush or every 64k bytes or so with less than a seconds latency. This would let us build systems that log directly into HDFS and have related systems respond based on those log streams. This is where I'd like to see us get with this issue. Clearly getting there involves getting a handle on all the stuff already discussed in this thread. We also need to think carefully about the pipelining and protocol issues involved in making this work. We might want to break the protocol change issues into another discussion, but I want to make sure we don't converge on solutions that will not work considering fine grained "flushes".
          Hide
          Doug Cutting added a comment -

          > sameer: solve the problem of detecting replicas from deleted files

          Is that actually a problem? We'll only generate such garbage blocks when nodes fail, and only a few then, so it doesn't happen at that high of a rate, and the current blockreport mechanism handles that. Block versioning won't let us get rid of the blockreport mechanism, so I still don't see that as an advantage of block versioning. Am I missing something?

          > eric: see updates at a reasonable rate, IE soon after each flush or every 64k bytes or so with less than a seconds latency

          That's a new requirement that I'd not heard before. The mechanism I'd proposed would not make appends visible until the file is closed, which is not what Eric's asking for. So this may make the case for block versioning. If we really must support what Eric's asking, then non-persistent timestamps on the blocks may be required.

          Show
          Doug Cutting added a comment - > sameer: solve the problem of detecting replicas from deleted files Is that actually a problem? We'll only generate such garbage blocks when nodes fail, and only a few then, so it doesn't happen at that high of a rate, and the current blockreport mechanism handles that. Block versioning won't let us get rid of the blockreport mechanism, so I still don't see that as an advantage of block versioning. Am I missing something? > eric: see updates at a reasonable rate, IE soon after each flush or every 64k bytes or so with less than a seconds latency That's a new requirement that I'd not heard before. The mechanism I'd proposed would not make appends visible until the file is closed, which is not what Eric's asking for. So this may make the case for block versioning. If we really must support what Eric's asking, then non-persistent timestamps on the blocks may be required.
          Hide
          stack added a comment -

          Doug: The mechanism I'd proposed would not make appends visible until the file is closed, which is not what Eric's asking for...

          To be clear, for the hbase scenario, appends need to be available to a reading process even if the file has not been closed (Otherwise updates can be lost).

          Show
          stack added a comment - Doug: The mechanism I'd proposed would not make appends visible until the file is closed, which is not what Eric's asking for... To be clear, for the hbase scenario, appends need to be available to a reading process even if the file has not been closed (Otherwise updates can be lost).
          Hide
          Sameer Paranjpye added a comment -

          > so it doesn't happen at that high of a rate, and the current blockreport mechanism handles that

          It doesn't, a blockreport only report block ids to the Namenode. If a Datanode goes down with garbage blocks, and some of those blockids are assigned to new files, and the Datanode comes back, the garbage blocks will be accepted as valid and cause filesystem corruption. We should try and prevent collisions that can cause corruption.

          Show
          Sameer Paranjpye added a comment - > so it doesn't happen at that high of a rate, and the current blockreport mechanism handles that It doesn't, a blockreport only report block ids to the Namenode. If a Datanode goes down with garbage blocks, and some of those blockids are assigned to new files, and the Datanode comes back, the garbage blocks will be accepted as valid and cause filesystem corruption. We should try and prevent collisions that can cause corruption.
          Hide
          Jim Kellerman added a comment -

          A more complete statement of the HBase use case is:

          One process is appending to a file. There is no requirement that the file be available to readers unless

          • the file is closed (what Hadoop has now)
          • the process writing to the file dies. (not currently available)

          For the latter case, we would be satisfied if a lease timeout on the file caused it to be closed.

          What happens (in HBase) if the process writing the file dies is that the HBase master server gets notified, and it then needs to open and read the file (it knows where the dead process put it).

          Show
          Jim Kellerman added a comment - A more complete statement of the HBase use case is: One process is appending to a file. There is no requirement that the file be available to readers unless the file is closed (what Hadoop has now) the process writing to the file dies. (not currently available) For the latter case, we would be satisfied if a lease timeout on the file caused it to be closed. What happens (in HBase) if the process writing the file dies is that the HBase master server gets notified, and it then needs to open and read the file (it knows where the dead process put it).
          Hide
          Doug Cutting added a comment -

          > If a Datanode goes down with garbage blocks, and some of those blockids are assigned to new files [ ... ]

          How is this related to append? Block id collisions are improbable events that perhaps we should concern ourselves with more. It seems you're arguing that we should minimize block id allocation in order to address this, but that seems like a tiny bandaid and mostly irrelevant to the present discussion.

          The proposal I'd made was that new block ids created by appends should only be reported to the namenode when they're complete, and the namenode should not commit a file's new block id list until all block ids are reported complete, making appends atomic. Such an approach would inhibit rapid visibility of appends, which may be required, and that is a vote against it. So we may have to go with something like non-persistent block timestamps, but I still don't see your point about non-persistent timestamps substantially reducing corruption. Rather they seem to create new opportunities for corruption that may require new mechanisms to prevent such corruptions, and that may just be the price we have to pay.

          Show
          Doug Cutting added a comment - > If a Datanode goes down with garbage blocks, and some of those blockids are assigned to new files [ ... ] How is this related to append? Block id collisions are improbable events that perhaps we should concern ourselves with more. It seems you're arguing that we should minimize block id allocation in order to address this, but that seems like a tiny bandaid and mostly irrelevant to the present discussion. The proposal I'd made was that new block ids created by appends should only be reported to the namenode when they're complete, and the namenode should not commit a file's new block id list until all block ids are reported complete, making appends atomic. Such an approach would inhibit rapid visibility of appends, which may be required, and that is a vote against it. So we may have to go with something like non-persistent block timestamps, but I still don't see your point about non-persistent timestamps substantially reducing corruption. Rather they seem to create new opportunities for corruption that may require new mechanisms to prevent such corruptions, and that may just be the price we have to pay.
          Hide
          Sameer Paranjpye added a comment -

          > How is this related to append?

          The mechanism used to detect these collisions i.e. non-persistent timestamps can be trivially leveraged to support appends.

          > So we may have to go with something like non-persistent block timestamps, but I still don't see your point about non-persistent timestamps substantially reducing corruption.

          They enable detection of stale replicas thereby eliminating that source of corruption.

          > Rather they seem to create new opportunities for corruption

          I don't see that. Could you explain?

          Show
          Sameer Paranjpye added a comment - > How is this related to append? The mechanism used to detect these collisions i.e. non-persistent timestamps can be trivially leveraged to support appends. > So we may have to go with something like non-persistent block timestamps, but I still don't see your point about non-persistent timestamps substantially reducing corruption. They enable detection of stale replicas thereby eliminating that source of corruption. > Rather they seem to create new opportunities for corruption I don't see that. Could you explain?
          Hide
          Doug Cutting added a comment -

          Ah. I think you're intending that datanodes do persist the block timestamp, but the namenode does not? If so, I'd missed that. I thought you'd said that timestamps were not persisted. But if datanodes persist them then that could indeed help detect block id collisions, since a timestamp collision would nearly impossible (assuming clocks are reasonably synchronized and accurate). So would DFSClient send the timestamp with each flushed buffer, and the datanode log it to a log that's replayed on startup?

          As for new opportunities for corruption, I simply meant that having multiple versions of a block increases the chances of getting the wrong version. The namenode and datanode will have substantial new logic to handle block versioning, and more logic increases the chances of faulty logic, introduces new failure modes, etc. The proposal I made required far fewer fundamental changes to block semantics, and thus mostly builds on already debugged logic. Changing blocks from immutable to mutable will require us to uncover all the places where we've assumed immutability. Some of these may not be obvious. That's all I meant. Just merrily spreading a little FUD about change of any sort!

          Show
          Doug Cutting added a comment - Ah. I think you're intending that datanodes do persist the block timestamp, but the namenode does not? If so, I'd missed that. I thought you'd said that timestamps were not persisted. But if datanodes persist them then that could indeed help detect block id collisions, since a timestamp collision would nearly impossible (assuming clocks are reasonably synchronized and accurate). So would DFSClient send the timestamp with each flushed buffer, and the datanode log it to a log that's replayed on startup? As for new opportunities for corruption, I simply meant that having multiple versions of a block increases the chances of getting the wrong version. The namenode and datanode will have substantial new logic to handle block versioning, and more logic increases the chances of faulty logic, introduces new failure modes, etc. The proposal I made required far fewer fundamental changes to block semantics, and thus mostly builds on already debugged logic. Changing blocks from immutable to mutable will require us to uncover all the places where we've assumed immutability. Some of these may not be obvious. That's all I meant. Just merrily spreading a little FUD about change of any sort!
          Hide
          eric baldeschwieler added a comment -

          Why would the name node not keep the timestamp/version? Seems to me it will need it to disambiguate new block reports and such. We can afford the bytes IMO.

          Those statements about "reasonably synchronized clocks" worry me. Especially if clients or datanodes are producing them. If the name node is not the source of truth, I'm afraid we can introduce weird cases.

          I think the next logical step here is for us to outline a more detail proposal for review. I think this is going to take a couple of distinct steps. We can outline those and get to agreement on the basics.

          Jim's request that a file exist if its producer dies before a close sounds like it would be easy, but then we get to flush semantics... I take it you don't just want the last complete 128M block Jim? If you want the last "log line" then you need something similar to what I outlined.

          Show
          eric baldeschwieler added a comment - Why would the name node not keep the timestamp/version? Seems to me it will need it to disambiguate new block reports and such. We can afford the bytes IMO. Those statements about "reasonably synchronized clocks" worry me. Especially if clients or datanodes are producing them. If the name node is not the source of truth, I'm afraid we can introduce weird cases. I think the next logical step here is for us to outline a more detail proposal for review. I think this is going to take a couple of distinct steps. We can outline those and get to agreement on the basics. Jim's request that a file exist if its producer dies before a close sounds like it would be easy, but then we get to flush semantics... I take it you don't just want the last complete 128M block Jim? If you want the last "log line" then you need something similar to what I outlined.
          Hide
          Sameer Paranjpye added a comment -

          > I think you're intending that datanodes do persist the block timestamp

          Yes, that is what I was intending. Sorry, if that wasn't clearer in my description. Timestamps would be persisted on the Datanodes but not on the Namenode. Having the DFSClient generate the timestamps would open us up to some issues with unsynchronized clocks, the probability of error is very low though. Having the Namenode generate the timestamps would be more robust, but it might mean more Namenode transactions depending on how frequently a timestamp is requested. It still exposes us to problems if the clock on the Namenode machine is turned back for some reason. Most robust would be the Namenode generating monotonically increasing revisions.

          The DFSClient could send the timestamp with each flushed buffer, but it's not clear that we need the timestamps to be so fine grained. A new timestamp could be generated every time a block is accessed for modification (and when an error occurs during an append). This would require no change in the number of Namenode transactions, but the corner cases when a client dies in the middle of a write could be trickier to handle.

          > As for new opportunities for corruption, I simply meant that having multiple versions of a block increases the chances of getting the wrong version

          Fair point. Block revisions will mean new code that needs extensive testing and debugging and will likely expose hidden assumptions in the code. We should get a lot of eyes on code that we write for this issue. Given the number of watchers on this issue, it shouldn't be a problem

          Show
          Sameer Paranjpye added a comment - > I think you're intending that datanodes do persist the block timestamp Yes, that is what I was intending. Sorry, if that wasn't clearer in my description. Timestamps would be persisted on the Datanodes but not on the Namenode. Having the DFSClient generate the timestamps would open us up to some issues with unsynchronized clocks, the probability of error is very low though. Having the Namenode generate the timestamps would be more robust, but it might mean more Namenode transactions depending on how frequently a timestamp is requested. It still exposes us to problems if the clock on the Namenode machine is turned back for some reason. Most robust would be the Namenode generating monotonically increasing revisions. The DFSClient could send the timestamp with each flushed buffer, but it's not clear that we need the timestamps to be so fine grained. A new timestamp could be generated every time a block is accessed for modification (and when an error occurs during an append). This would require no change in the number of Namenode transactions, but the corner cases when a client dies in the middle of a write could be trickier to handle. > As for new opportunities for corruption, I simply meant that having multiple versions of a block increases the chances of getting the wrong version Fair point. Block revisions will mean new code that needs extensive testing and debugging and will likely expose hidden assumptions in the code. We should get a lot of eyes on code that we write for this issue. Given the number of watchers on this issue, it shouldn't be a problem
          Hide
          Doug Cutting added a comment -

          > sameer: [ ... ] but the corner cases when a client dies in the middle of a write could be trickier to handle.

          Yes, that worries me. I'd be more comfortable sending a new version per flush, so that we have some atomic ground to stand on. Perhaps we can address clock skew by, as you suggest, using revisions rather than timestamps. The namenode need not be involved in the allocation of each revision, only in initializing the revision counter for a modification sequence. After that, the DFSClient can increment the revision for each buffer. Do we lose anything by not using timestamps? We could avoid collisions of blockid+revision with the filesystem id, no?

          > eric: the next logical step here is for us to outline a more detail proposal

          Yep. That'd be good. +1

          Show
          Doug Cutting added a comment - > sameer: [ ... ] but the corner cases when a client dies in the middle of a write could be trickier to handle. Yes, that worries me. I'd be more comfortable sending a new version per flush, so that we have some atomic ground to stand on. Perhaps we can address clock skew by, as you suggest, using revisions rather than timestamps. The namenode need not be involved in the allocation of each revision, only in initializing the revision counter for a modification sequence. After that, the DFSClient can increment the revision for each buffer. Do we lose anything by not using timestamps? We could avoid collisions of blockid+revision with the filesystem id, no? > eric: the next logical step here is for us to outline a more detail proposal Yep. That'd be good. +1
          Hide
          Doug Cutting added a comment -

          It just occurred to me that we already have a number that monotonically increases as we append to a block: its length. So long as we don't intend to support file modifications other than appends, this could work as the revision. It would have the advantage that, if a node has a stale copy of a block, it could incrementally update the block. Truncation might be tricky with this approach, but perhaps the current version of the final block could be, rather than the longest version, the version that matches the length of the file. Just a thought.

          Show
          Doug Cutting added a comment - It just occurred to me that we already have a number that monotonically increases as we append to a block: its length. So long as we don't intend to support file modifications other than appends, this could work as the revision. It would have the advantage that, if a node has a stale copy of a block, it could incrementally update the block. Truncation might be tricky with this approach, but perhaps the current version of the final block could be, rather than the longest version, the version that matches the length of the file. Just a thought.
          Hide
          Jim Kellerman added a comment -

          eric baldeschwieler - 06/Sep/07 01:29 AM
          > Jim's request that a file exist if its producer dies before a close sounds like it would be easy, but then we
          > get to flush semantics... I take it you don't just want the last complete 128M block Jim? If you want the
          > last "log line" then you need something similar to what I outlined.

          Yes, we want to get back the last record we wrote. (unless people think a lossy database is ok

          Typically we write multiple records at a time. It would not be a big deal for us to follow a group of writes with a flush, since if we can't write the whole group, we'd rather not get a partial group back.

          So this means that we don't need every write to be atomic, just every flush. (Think database transaction with multiple updates: either they all succeed or they all fail)

          Show
          Jim Kellerman added a comment - eric baldeschwieler - 06/Sep/07 01:29 AM > Jim's request that a file exist if its producer dies before a close sounds like it would be easy, but then we > get to flush semantics... I take it you don't just want the last complete 128M block Jim? If you want the > last "log line" then you need something similar to what I outlined. Yes, we want to get back the last record we wrote. (unless people think a lossy database is ok Typically we write multiple records at a time. It would not be a big deal for us to follow a group of writes with a flush, since if we can't write the whole group, we'd rather not get a partial group back. So this means that we don't need every write to be atomic, just every flush. (Think database transaction with multiple updates: either they all succeed or they all fail)
          Hide
          eric baldeschwieler added a comment -

          Yes, I'd thought of using length too. We have had requests to support truncate as well. I'm on the fence on that one. It is simple and more provably correct if we don't do truncates and just track block length.

          But we've a significant client group that has had an interest in truncates. We'll do some chewing on that one and see if we want to propose one project that does truncates and appends (less disruption to get both features) or appends only and presumably defer our ambition to do truncates for at least a year.

          The combination of very high block replication counts and network partitions still feels like it might produce some kind of nasty cases here.

          Show
          eric baldeschwieler added a comment - Yes, I'd thought of using length too. We have had requests to support truncate as well. I'm on the fence on that one. It is simple and more provably correct if we don't do truncates and just track block length. But we've a significant client group that has had an interest in truncates. We'll do some chewing on that one and see if we want to propose one project that does truncates and appends (less disruption to get both features) or appends only and presumably defer our ambition to do truncates for at least a year. The combination of very high block replication counts and network partitions still feels like it might produce some kind of nasty cases here.
          Hide
          dhruba borthakur added a comment -
          Here is a slightly more detailed description of a proposal to support appending writes to files.
          There is an DataGenerationStamp associated with every block. It is persisted by the namenode and by the datanode(s).
          
          *The Writer*
          1. The client requests the namenode for a new block. The namenode generates a new blockid and associates a DataGenerationStamp 
              of 0 with this block. It persists this blockId in the inode and the DataGenerationStamp in the BlocksMap. The Namenode 
              returns the blockid, DataGenerationStamp and block locations to the Client.
          2. The Client sends the blockid and DataGenerationStamp to all the datanodes in the pipeline. The Datanodes record the blockid and the 
              DataGenerationStamp persistently and returns. In case of error, go to Step 3a.
          3. The Client then starts streaming data to the Datanodes in the pipeline.
              The Client notices if any datanodes in the pipeline encountered an error. In this case:
              3a. The Client removes the bad datanode from the pipeline.
              3b. The Client requests a new DataGenerationStamp for this block from the NameNode. The Client also informs the Namenode 
                    of the bad Datanode.
              3c. The Namenode removes the bad datanode as a valid block location for this block. The Namenode increments the current 
                    DataGenerationStamp by one, persists it, and returns it to the Client.
              3d. The Client sends the new DataGenerationStamp to all remaining datanodes in the pipeline.
              3e. The Datanodes receive the new DataGenerationStamp and persist it.
              3f. The Client can now continue, go back to Step 3 above.
          4. The Datanode sends block confirmations to the namenode when the full block is received. The block confirmation has the 
              blockid and DataGenerationStamp in it.
          5. The Namenode receives a block confirmation from a Datanode. If the DataGenerationStamp does not match with what is stored in 
              the inode, the namenode refuses to consider that Datanode as a valid replica location. The namenode sends a block delete 
              command to that Datanode.
          
          *Reader (concurrent reading while file is being appended to)*
          1.  A reader that opens a file gets the list of blocks from the Namenode. Each block has the block locations 
              and DataGenerationStamp too.
          2.  A client sends the DataGenerationStamp along with every read request to a datanode. The datanode refuses the serve the 
               data if the DataGenerationStamp does not match with the value in its persistent store. In this case, the client will fail 
               over to other datanodes.
          
          This algorithm came out of a discussion with Sameer. This solution does not solve the problem of duplicate blockids 
          that can result when datanodes that were  down for a long time re-appear.
          
          
          Show
          dhruba borthakur added a comment - Here is a slightly more detailed description of a proposal to support appending writes to files. There is an DataGenerationStamp associated with every block. It is persisted by the namenode and by the datanode(s). *The Writer* 1. The client requests the namenode for a new block. The namenode generates a new blockid and associates a DataGenerationStamp of 0 with this block. It persists this blockId in the inode and the DataGenerationStamp in the BlocksMap. The Namenode returns the blockid, DataGenerationStamp and block locations to the Client. 2. The Client sends the blockid and DataGenerationStamp to all the datanodes in the pipeline. The Datanodes record the blockid and the DataGenerationStamp persistently and returns. In case of error, go to Step 3a. 3. The Client then starts streaming data to the Datanodes in the pipeline. The Client notices if any datanodes in the pipeline encountered an error. In this case: 3a. The Client removes the bad datanode from the pipeline. 3b. The Client requests a new DataGenerationStamp for this block from the NameNode. The Client also informs the Namenode of the bad Datanode. 3c. The Namenode removes the bad datanode as a valid block location for this block. The Namenode increments the current DataGenerationStamp by one, persists it, and returns it to the Client. 3d. The Client sends the new DataGenerationStamp to all remaining datanodes in the pipeline. 3e. The Datanodes receive the new DataGenerationStamp and persist it. 3f. The Client can now continue, go back to Step 3 above. 4. The Datanode sends block confirmations to the namenode when the full block is received. The block confirmation has the blockid and DataGenerationStamp in it. 5. The Namenode receives a block confirmation from a Datanode. If the DataGenerationStamp does not match with what is stored in the inode, the namenode refuses to consider that Datanode as a valid replica location. The namenode sends a block delete command to that Datanode. *Reader (concurrent reading while file is being appended to)* 1. A reader that opens a file gets the list of blocks from the Namenode. Each block has the block locations and DataGenerationStamp too. 2. A client sends the DataGenerationStamp along with every read request to a datanode. The datanode refuses the serve the data if the DataGenerationStamp does not match with the value in its persistent store. In this case, the client will fail over to other datanodes. This algorithm came out of a discussion with Sameer. This solution does not solve the problem of duplicate blockids that can result when datanodes that were down for a long time re-appear.
          Hide
          Doug Cutting added a comment -

          When is the block data transmitted? Does a datanode not persist its stamp for a block until after it has persisted the data corresponding to that stamp? That way, if it crashes before it recieves all of the data, it has an out-of-date stamp. Is that right?

          What happens when a block is partially modified and the client crashes?

          Show
          Doug Cutting added a comment - When is the block data transmitted? Does a datanode not persist its stamp for a block until after it has persisted the data corresponding to that stamp? That way, if it crashes before it recieves all of the data, it has an out-of-date stamp. Is that right? What happens when a block is partially modified and the client crashes?
          Hide
          dhruba borthakur added a comment -

          Block data is transmitted in Step 3 above. The datanode(s) persist the DataGenerationStamp as soon as it receives it. Then it starts receiving the data. If it crashes before it received all the data, the Client detects this condition and increments the DataGenerationStamp on the namenode and on the remaining good datanodes.

          Let's consider the case when the client crashes before the current write (or flush) is successfully transmitted to all datanodes. In this case it is possible that the Datanodes have different sizes of this block. In this case, the lease expires on the Namenode and it is the namenode's duty to do the recovery that the client would have otherwise done. The namenode fetches from the Datanode the size of each replica of the block-under-modification and selects the largest size block as valid. It increments the DataGenerationStamp and sends this new stamp to the datanodes that have the largest size replica.

          What happens if the namenode crashes before it could complete the entire lease-timeout-triggered-recovery (described above) for blocks that were being modified? I am thinking about this one.

          Show
          dhruba borthakur added a comment - Block data is transmitted in Step 3 above. The datanode(s) persist the DataGenerationStamp as soon as it receives it. Then it starts receiving the data. If it crashes before it received all the data, the Client detects this condition and increments the DataGenerationStamp on the namenode and on the remaining good datanodes. Let's consider the case when the client crashes before the current write (or flush) is successfully transmitted to all datanodes. In this case it is possible that the Datanodes have different sizes of this block. In this case, the lease expires on the Namenode and it is the namenode's duty to do the recovery that the client would have otherwise done. The namenode fetches from the Datanode the size of each replica of the block-under-modification and selects the largest size block as valid. It increments the DataGenerationStamp and sends this new stamp to the datanodes that have the largest size replica. What happens if the namenode crashes before it could complete the entire lease-timeout-triggered-recovery (described above) for blocks that were being modified? I am thinking about this one.
          Hide
          Doug Cutting added a comment -

          This sounds hairy to me, with lots of edge conditions. And we have not yet specified how truncate will work. If the point of using a DataGenerationStamp instead of simply block length is to support truncates too, then we should design and implement truncate too, else how do we know that the design really is sufficient for truncate?

          At this point I'm leaning towards only supporting append and using block length. The semantics are much simpler. Eric?

          Show
          Doug Cutting added a comment - This sounds hairy to me, with lots of edge conditions. And we have not yet specified how truncate will work. If the point of using a DataGenerationStamp instead of simply block length is to support truncates too, then we should design and implement truncate too, else how do we know that the design really is sufficient for truncate? At this point I'm leaning towards only supporting append and using block length. The semantics are much simpler. Eric?
          Hide
          Raghu Angadi added a comment -

          Is there a description of what happens when a client wants to append data to a file?

          Show
          Raghu Angadi added a comment - Is there a description of what happens when a client wants to append data to a file?
          Hide
          eric baldeschwieler added a comment -

          I think we can get away without supporting truncate in the medium term. Less is more...
          That said, I don't have a strong opinion on generation numbers vs length.

          I'd like to be sure that we are able to detect the stale block issue. Maybe a unique file ID?

          Does this proposal try to finish with fewer datanodes than requested or does it request new ones?

          Show
          eric baldeschwieler added a comment - I think we can get away without supporting truncate in the medium term. Less is more... That said, I don't have a strong opinion on generation numbers vs length. I'd like to be sure that we are able to detect the stale block issue. Maybe a unique file ID? Does this proposal try to finish with fewer datanodes than requested or does it request new ones?
          Hide
          Sameer Paranjpye added a comment -

          Generation numbers would be needed even if we don't support truncate. Using length is not reliable, there are a few corner cases where data can be silently lost. In addition, the proposal described above isn't completely consistent with our discussions. We'll need to publish something a lot more detailed with explanations of how corner cases are handled.

          Show
          Sameer Paranjpye added a comment - Generation numbers would be needed even if we don't support truncate. Using length is not reliable, there are a few corner cases where data can be silently lost. In addition, the proposal described above isn't completely consistent with our discussions. We'll need to publish something a lot more detailed with explanations of how corner cases are handled.
          Hide
          dhruba borthakur added a comment -

          The above proposal says that if a datanode in the pipeline dies, the client detects this condition but does not request a new datanode as a replacement. If this does not give us an acceptable level of reliability,we might have to design something better.

          Show
          dhruba borthakur added a comment - The above proposal says that if a datanode in the pipeline dies, the client detects this condition but does not request a new datanode as a replacement. If this does not give us an acceptable level of reliability,we might have to design something better.
          Hide
          dhruba borthakur added a comment -

          Here is a more detailed description of the proposal to support appends in HDFS. This is just the first draft and I will be expanding more on it next week. Please review and let me know your feedback.

          Show
          dhruba borthakur added a comment - Here is a more detailed description of the proposal to support appends in HDFS. This is just the first draft and I will be expanding more on it next week. Please review and let me know your feedback.
          Hide
          dhruba borthakur added a comment -

          A Word document that has the same contents as the html design document

          Show
          dhruba borthakur added a comment - A Word document that has the same contents as the html design document
          Hide
          Doug Cutting added a comment -

          That html was pretty unreadable, at least in firefox. I re-generated from the .doc using Open Office.

          Show
          Doug Cutting added a comment - That html was pretty unreadable, at least in firefox. I re-generated from the .doc using Open Office.
          Hide
          Doug Cutting added a comment -

          "The Client notices if any Datanodes in the pipeline encountered an error [... ]".

          We need a bit more discussion of how this will work. The typical error will not be a nice message sent back up the pipeline, but rather will be a timeout, breaking the pipeline. We must be careful so that timeouts do not cascade, perhaps by decreasing the timeout for each stage added to the pipeline. Removing a node from the pipeline then means building a new pipeline from at least the node before the fracture onwards. Is this the sort of mechanism you have in mind?

          Show
          Doug Cutting added a comment - "The Client notices if any Datanodes in the pipeline encountered an error [... ] ". We need a bit more discussion of how this will work. The typical error will not be a nice message sent back up the pipeline, but rather will be a timeout, breaking the pipeline. We must be careful so that timeouts do not cascade, perhaps by decreasing the timeout for each stage added to the pipeline. Removing a node from the pipeline then means building a new pipeline from at least the node before the fracture onwards. Is this the sort of mechanism you have in mind?
          Hide
          dhruba borthakur added a comment -

          This version has description about how appends will work with BlockRebalancer and Datanode Periodic verification. It also has details on how a client-flush call will handle data that is smaller than a crc chunk.

          Show
          dhruba borthakur added a comment - This version has description about how appends will work with BlockRebalancer and Datanode Periodic verification. It also has details on how a client-flush call will handle data that is smaller than a crc chunk.
          Hide
          Jim Kellerman added a comment -

          Comments on the document.

          In the Requirements section:

          > There can be many simultaneous writers reading the same file at the same time.

          Do you mean readers here instead of writers?

          As far as I understand the internals of the DFS, this appears to satisfy our requirements.
          I think the section entitled "File Deletions and Renames" addresses our needs, but let me be sure by asking the following:

          A process is writing to a file and has done flushes. Another process detects that the writer is now dead and reads the file (up to the point it was last flushed) and then deletes the file (because it knows the writer is dead). This happens well before the file lease timeout. My understanding is that the file will be deleted and that if another process tries to create a file with the same name, this should succeed and not contain any of the contents of the original file. Is this correct?

          If so, ++1

          Show
          Jim Kellerman added a comment - Comments on the document. In the Requirements section: > There can be many simultaneous writers reading the same file at the same time. Do you mean readers here instead of writers? As far as I understand the internals of the DFS, this appears to satisfy our requirements. I think the section entitled "File Deletions and Renames" addresses our needs, but let me be sure by asking the following: A process is writing to a file and has done flushes. Another process detects that the writer is now dead and reads the file (up to the point it was last flushed) and then deletes the file (because it knows the writer is dead). This happens well before the file lease timeout. My understanding is that the file will be deleted and that if another process tries to create a file with the same name, this should succeed and not contain any of the contents of the original file. Is this correct? If so, ++1
          Hide
          Jim Kellerman added a comment -

          One more comment. As I am not entirely familiar with the internals of DFS, I'd like to see a +1 from Doug (Cutting) and/or Konstantin.

          Show
          Jim Kellerman added a comment - One more comment. As I am not entirely familiar with the internals of DFS, I'd like to see a +1 from Doug (Cutting) and/or Konstantin.
          Hide
          dhruba borthakur added a comment -

          Thanks for you comments. Your understanding is absolutely correct. I made the user-case explicit in the section titled "File Deletions and Renames". Thanks.

          Show
          dhruba borthakur added a comment - Thanks for you comments. Your understanding is absolutely correct. I made the user-case explicit in the section titled "File Deletions and Renames". Thanks.
          Hide
          dhruba borthakur added a comment -

          An html version of this design document.

          Show
          dhruba borthakur added a comment - An html version of this design document.
          Hide
          Doug Cutting added a comment -

          I think this design looks well-enough thought through that we should probably start trying to implement it.

          This design document is great to have! Can you please convert it to forrest XML and add it to a reference section of the documentation tree along with the patch for this issue?

          Show
          Doug Cutting added a comment - I think this design looks well-enough thought through that we should probably start trying to implement it. This design document is great to have! Can you please convert it to forrest XML and add it to a reference section of the documentation tree along with the patch for this issue?
          Hide
          dhruba borthakur added a comment -

          Incorporated review comments from Rob Chansler. Thanks Rob.

          Show
          dhruba borthakur added a comment - Incorporated review comments from Rob Chansler. Thanks Rob.
          Hide
          Ruyue Ma added a comment -

          Hi, everyone

          We are making our best effort to improve Hadoop Distributed
          Filesystem(HDFS). Now, our plan
          is implementing single client appending and truncating based on hadoop
          0.15.0 version.

          Our implementation will follow the desigh of Hadoop-1700. In the
          latest version of Hadoop-1700,
          we found some points (or bugs) to be corrected/improved. Here are the points:

          0. We recommend: the BlockID don't adopt random 64bit ID. We give a
          global variable NextBlockID.
          The NextBlockID is 64bit, initialized to 0, is recorded into
          transaction log. When we need a new BlockID,
          we give NextBlockID as the new BlockID, and the NextBlockID is
          incremented by 1. The new NextBlockID
          is recorded into transaction log.

          1. In section The Writer of Hadoop-1700, the original text is:

          ____________________________________________________________________________________________________________________________
          ..........
          The Writer requests the Namenode to create a new file or open an
          existing file with an intention of appending to it.
          The Namenode generates a new blockId and a new GenerationStamp for
          this block. Let's call the GenerationStamp that
          is associated with a block as BlockGenerationStamp. A new
          BlockGenerationStamp is generated by incrementing the
          global GenerationStamp by one and storing the global GenerationStamp
          back into the transaction log. It records
          the blockId, block locations and the BlockGenerationStamp in the
          BlocksMap. The Namenode returns the blockId,
          BlockGenerationStamp and block locations to the Client.
          .........
          _______________________________________________________________________________________________________________________________
          our comment:
          The Writer requests the Namenode to create a new file or open an
          existing file with an intention of appending to it.
          The Namenode generates a new blockId and a new GenerationStamp for
          this block. Let's call the GenerationStamp that
          is associated with a block as BlockGenerationStamp. A new
          BlockGenerationStamp is generated by incrementing the
          global GenerationStamp by one and storing the global GenerationStamp
          back into the transaction log. ____If the block is
          a new block, the Namenode returns the blockID, BlockGenerationStamp
          and block locations to the client. The namenode don't
          records the blockID, BlockLocations, BlockGenerationStamp in
          BlocksMap. If the block is an existing block
          (the last block is not full), we generate a new BlockGenerationStamp
          and return the BlockID, BlockLocations, Old BlockGenerationStamp,
          new BlockGenerationStamp to the client, the namenode don't records the
          new BlockGenerationStamp._____

          When client successfully updates the all datanodes where the replicas
          of block is located in, the client tells namenode that
          he has successfully updated the datanodes, along with the
          information (BlockID, BlockLocations, new BlockGenerationStamp).
          Then, Namenode records the blockId, block locations and the new
          BlockGenerationStamp in the BlocksMap, and creates the OpenFile
          transaction log.
          ____________________________________________________________________________________________________________________________-

          • notes: our method can tolerate the following failure. If namenode
            records the new BlockGenerationStamp in BlocksMap, when client don't
            update the
            datanodes or client (writer) crashes, namenode will
            start lease recovery. At this moment, replica version may be smaller
            than the version
            recorded in namenode. So, namenode will discard the
            replicas. It is not our expected result. Our method can tolerate this
            failure.

          2. In hadoop-1700, OpenFile transaction log records all the blocklist.
          If client invokes flush operation frenquently, maybe the overhead of
          namenode is very heavy. So we adopt the following method:
          we only record the varying block, don't record the blocks which
          are not being modified.
          If a new block is created, we record the blockID,
          BlockGenerationStamp. If the BLockGenerationStamp of a block is
          modified, we
          only record the new BlockGenerationStamp for the block.
          By this method, we reduce the overhead of namenode when namenode
          creates the OpenFile transaction log.

          We look forward to getting help from Hadoop Community. Your any advice
          will be appreciated.

          Show
          Ruyue Ma added a comment - Hi, everyone We are making our best effort to improve Hadoop Distributed Filesystem(HDFS). Now, our plan is implementing single client appending and truncating based on hadoop 0.15.0 version. Our implementation will follow the desigh of Hadoop-1700. In the latest version of Hadoop-1700, we found some points (or bugs) to be corrected/improved. Here are the points: 0. We recommend: the BlockID don't adopt random 64bit ID. We give a global variable NextBlockID. The NextBlockID is 64bit, initialized to 0, is recorded into transaction log. When we need a new BlockID, we give NextBlockID as the new BlockID, and the NextBlockID is incremented by 1. The new NextBlockID is recorded into transaction log. 1. In section The Writer of Hadoop-1700, the original text is: ____________________________________________________________________________________________________________________________ .......... The Writer requests the Namenode to create a new file or open an existing file with an intention of appending to it. The Namenode generates a new blockId and a new GenerationStamp for this block. Let's call the GenerationStamp that is associated with a block as BlockGenerationStamp. A new BlockGenerationStamp is generated by incrementing the global GenerationStamp by one and storing the global GenerationStamp back into the transaction log. It records the blockId, block locations and the BlockGenerationStamp in the BlocksMap. The Namenode returns the blockId, BlockGenerationStamp and block locations to the Client. ......... _______________________________________________________________________________________________________________________________ our comment: The Writer requests the Namenode to create a new file or open an existing file with an intention of appending to it. The Namenode generates a new blockId and a new GenerationStamp for this block. Let's call the GenerationStamp that is associated with a block as BlockGenerationStamp. A new BlockGenerationStamp is generated by incrementing the global GenerationStamp by one and storing the global GenerationStamp back into the transaction log. ____If the block is a new block, the Namenode returns the blockID, BlockGenerationStamp and block locations to the client. The namenode don't records the blockID, BlockLocations, BlockGenerationStamp in BlocksMap. If the block is an existing block (the last block is not full), we generate a new BlockGenerationStamp and return the BlockID, BlockLocations, Old BlockGenerationStamp, new BlockGenerationStamp to the client, the namenode don't records the new BlockGenerationStamp._____ When client successfully updates the all datanodes where the replicas of block is located in, the client tells namenode that he has successfully updated the datanodes, along with the information (BlockID, BlockLocations, new BlockGenerationStamp). Then, Namenode records the blockId, block locations and the new BlockGenerationStamp in the BlocksMap, and creates the OpenFile transaction log. ____________________________________________________________________________________________________________________________- notes: our method can tolerate the following failure. If namenode records the new BlockGenerationStamp in BlocksMap, when client don't update the datanodes or client (writer) crashes, namenode will start lease recovery. At this moment, replica version may be smaller than the version recorded in namenode. So, namenode will discard the replicas. It is not our expected result. Our method can tolerate this failure. 2. In hadoop-1700, OpenFile transaction log records all the blocklist. If client invokes flush operation frenquently, maybe the overhead of namenode is very heavy. So we adopt the following method: we only record the varying block, don't record the blocks which are not being modified. If a new block is created, we record the blockID, BlockGenerationStamp. If the BLockGenerationStamp of a block is modified, we only record the new BlockGenerationStamp for the block. By this method, we reduce the overhead of namenode when namenode creates the OpenFile transaction log. We look forward to getting help from Hadoop Community. Your any advice will be appreciated.
          Hide
          dhruba borthakur added a comment -

          Thanks Ruyue for your comments.

          0. Regarding your comment about using a sequential block-id, instead of a random block-id: How do we uprade existing clusters? An existing cluster can have huge number of already allocated blocks.

          1. Your comment about Point 1 makes sense. i will update the document.

          2. Regarding Point 2: a file typically has few blocks, ranging from somewhere between 2 to 10 blocks. I like your proposal to optimize transaction logging, especially when the number of blocks in a file are huge. Is it possible to consider it as an enhancement and implement it after implementing HDFS Apends? Won't the system be simpler if we avoid this optimization at first go and then once HDFS Append is committed, then we can work on this optimization?

          Show
          dhruba borthakur added a comment - Thanks Ruyue for your comments. 0. Regarding your comment about using a sequential block-id, instead of a random block-id: How do we uprade existing clusters? An existing cluster can have huge number of already allocated blocks. 1. Your comment about Point 1 makes sense. i will update the document. 2. Regarding Point 2: a file typically has few blocks, ranging from somewhere between 2 to 10 blocks. I like your proposal to optimize transaction logging, especially when the number of blocks in a file are huge. Is it possible to consider it as an enhancement and implement it after implementing HDFS Apends? Won't the system be simpler if we avoid this optimization at first go and then once HDFS Append is committed, then we can work on this optimization?
          Hide
          Ruyue Ma added a comment -

          this patch gives a simple append implementation based on 0.15.1 version. it is completely compatible with the 0.15.1. It adds a new open() function in DFSClient class.

          Usage:

          1. open for read

          InputStream is = (FSDataInputStream)dfsclient.open(path, buffersize, "r");

          2. open for append

          OutputStream os = (FSDataOutputStream)dfsclient.open(path, buffersize, "a");

          3. open for append (if no file, create it)

          OutputStream os = (FSDataOutputStream)dfsclient.open(path, buffersize, "a+c");

          Show
          Ruyue Ma added a comment - this patch gives a simple append implementation based on 0.15.1 version. it is completely compatible with the 0.15.1. It adds a new open() function in DFSClient class. Usage: 1. open for read InputStream is = (FSDataInputStream)dfsclient.open(path, buffersize, "r"); 2. open for append OutputStream os = (FSDataOutputStream)dfsclient.open(path, buffersize, "a"); 3. open for append (if no file, create it) OutputStream os = (FSDataOutputStream)dfsclient.open(path, buffersize, "a+c");
          Hide
          Ruyue Ma added a comment -

          I have given a new HDFS VFS implementation via fuse-j. This VFS is based on our append implementation.
          If any one need it , please send mail to me. My mail is ruyue.ma@gmail.com
          The vfs implementation works correctly.

          Show
          Ruyue Ma added a comment - I have given a new HDFS VFS implementation via fuse-j. This VFS is based on our append implementation. If any one need it , please send mail to me. My mail is ruyue.ma@gmail.com The vfs implementation works correctly.
          Hide
          Ruyue Ma added a comment -

          our append.patch only simple. The goal is trying to keep compatible with the 0.15.1.

          • It dosen't support fine-grained client flush.
          • I use a sequential block-id, instead of a random block-id, so no blockGenerationStamp is added.
          Show
          Ruyue Ma added a comment - our append.patch only simple. The goal is trying to keep compatible with the 0.15.1. It dosen't support fine-grained client flush. I use a sequential block-id, instead of a random block-id, so no blockGenerationStamp is added.
          Hide
          Ruyue Ma added a comment -

          Our next patch will give complete append implementation according to Hadoop 1700

          Show
          Ruyue Ma added a comment - Our next patch will give complete append implementation according to Hadoop 1700
          Hide
          Ruyue Ma added a comment -

          this patch gives a simple append implementation based on 0.15.1 version. it is completely compatible with the 0.15.1. It adds a new open() function in DFSClient class.

          Usage:

          1. open for read

          InputStream is = (FSDataInputStream)dfsclient.open(path, buffersize, "r");

          2. open for append

          OutputStream os = (FSDataOutputStream)dfsclient.open(path, buffersize, "a");

          3. open for append (if no file, create it)

          OutputStream os = (FSDataOutputStream)dfsclient.open(path, buffersize, "a+c");
          [ Show » ]
          Ruyue Ma - 25/Dec/07 06:49 PM this patch gives a simple append implementation based on 0.15.1 version. it is completely compatible with the 0.15.1. It adds a new open() function in DFSClient class. Usage: 1. open for read InputStream is = (FSDataInputStream)dfsclient.open(path, buffersize, "r"); 2. open for append OutputStream os = (FSDataOutputStream)dfsclient.open(path, buffersize, "a"); 3. open for append (if no file, create it) OutputStream os = (FSDataOutputStream)dfsclient.open(path, buffersize, "a+c");

          Show
          Ruyue Ma added a comment - this patch gives a simple append implementation based on 0.15.1 version. it is completely compatible with the 0.15.1. It adds a new open() function in DFSClient class. Usage: 1. open for read InputStream is = (FSDataInputStream)dfsclient.open(path, buffersize, "r"); 2. open for append OutputStream os = (FSDataOutputStream)dfsclient.open(path, buffersize, "a"); 3. open for append (if no file, create it) OutputStream os = (FSDataOutputStream)dfsclient.open(path, buffersize, "a+c"); [ Show » ] Ruyue Ma - 25/Dec/07 06:49 PM this patch gives a simple append implementation based on 0.15.1 version. it is completely compatible with the 0.15.1. It adds a new open() function in DFSClient class. Usage: 1. open for read InputStream is = (FSDataInputStream)dfsclient.open(path, buffersize, "r"); 2. open for append OutputStream os = (FSDataOutputStream)dfsclient.open(path, buffersize, "a"); 3. open for append (if no file, create it) OutputStream os = (FSDataOutputStream)dfsclient.open(path, buffersize, "a+c");
          Hide
          stack added a comment -

          Ruyue:

          I know little of hdfs internals but here are a few comments on the patch.

          Looks like you made your patch with diff going between a java-old and a java-new directory. Can you makie it instead by doing a 'svn diff' inside the top directory of an hadoop checkout? You should also make the patch apply to TRUNK. Currently it does not. Its unlikely that your patch will be added to the 0.15.x branch since usually only bug fixes are made against released versions and as is, it makes it hard for me to take your patch for a test drive.

          Tabs should be purged and replaced by spaces to make your patch palatable.

          New classes could do with javadoc explaining their purpose: e.g. BasicAppendInfo class comment could explain where its used and the payload it carries.

          New methods added to the protocol could also do with javadoc: e.g. baseAppendFile. Also, I'd guess that as is, your patch will emit javadoc warnings which will make it so your patch will fail the patch-build (your createFile method does not detail the method parameters). You also have an instance of empty javadoc.

          Your getting of blockid is synchronized but not on setting.

          Show
          stack added a comment - Ruyue: I know little of hdfs internals but here are a few comments on the patch. Looks like you made your patch with diff going between a java-old and a java-new directory. Can you makie it instead by doing a 'svn diff' inside the top directory of an hadoop checkout? You should also make the patch apply to TRUNK. Currently it does not. Its unlikely that your patch will be added to the 0.15.x branch since usually only bug fixes are made against released versions and as is, it makes it hard for me to take your patch for a test drive. Tabs should be purged and replaced by spaces to make your patch palatable. New classes could do with javadoc explaining their purpose: e.g. BasicAppendInfo class comment could explain where its used and the payload it carries. New methods added to the protocol could also do with javadoc: e.g. baseAppendFile. Also, I'd guess that as is, your patch will emit javadoc warnings which will make it so your patch will fail the patch-build (your createFile method does not detail the method parameters). You also have an instance of empty javadoc. Your getting of blockid is synchronized but not on setting.
          Hide
          dhruba borthakur added a comment -

          Hi Ruyue,

          I like all of the preceding review comments by stack. I have two basic questions.

          1. If a client has opened a file for "append" and is writing to the end of the file. The last block of the file has three replicas. It can so happen that the system fails to write the data to some replicas. What is your error recovery policy? All replicas would be of different size, can you pl point me to the code that handles this inconsistency? I cannot find this code in your patch.

          2. The DataNode, while finalizing a block, sends the block-size to the NameNode. In the case of "appends", the block size of the block changes with every write. Can you pl point me to the code that makesthe necessary changes to the Namenode's data structure that reflects the new-size after the append is finished? I could not find this code in your patch.

          thanks,
          dhruba

          Show
          dhruba borthakur added a comment - Hi Ruyue, I like all of the preceding review comments by stack. I have two basic questions. 1. If a client has opened a file for "append" and is writing to the end of the file. The last block of the file has three replicas. It can so happen that the system fails to write the data to some replicas. What is your error recovery policy? All replicas would be of different size, can you pl point me to the code that handles this inconsistency? I cannot find this code in your patch. 2. The DataNode, while finalizing a block, sends the block-size to the NameNode. In the case of "appends", the block size of the block changes with every write. Can you pl point me to the code that makesthe necessary changes to the Namenode's data structure that reflects the new-size after the append is finished? I could not find this code in your patch. thanks, dhruba
          Hide
          Chad Walters added a comment -

          What's the status on this patch? Is there anything that can be done to move this along? This is long-awaited functionality on many fronts. There was also some mention by Ruyue of a more complete patch coming soon. Would be willing to help out in any way possible.

          Show
          Chad Walters added a comment - What's the status on this patch? Is there anything that can be done to move this along? This is long-awaited functionality on many fronts. There was also some mention by Ruyue of a more complete patch coming soon. Would be willing to help out in any way possible.
          Hide
          dhruba borthakur added a comment -

          Hello Chad, I am currently working on HADOOP-2345. I created additional JIRAs (HADOOP-2655, HADOOP-2656) that are pre-requisites for this one. Would appreciate any help in addressing these JIRAs.

          Show
          dhruba borthakur added a comment - Hello Chad, I am currently working on HADOOP-2345 . I created additional JIRAs ( HADOOP-2655 , HADOOP-2656 ) that are pre-requisites for this one. Would appreciate any help in addressing these JIRAs.
          Hide
          dhruba borthakur added a comment -

          I am canceling this patch because I am not convinced that the current patch does error recovery.

          Show
          dhruba borthakur added a comment - I am canceling this patch because I am not convinced that the current patch does error recovery.
          Hide
          stack added a comment -

          Was wondering if any updated status on this patch. We're having an HBase User Group meeting this evening and I'm sure there'll be an inquiry as to state of this issue. Thanks.

          Show
          stack added a comment - Was wondering if any updated status on this patch. We're having an HBase User Group meeting this evening and I'm sure there'll be an inquiry as to state of this issue. Thanks.
          Hide
          dhruba borthakur added a comment -

          Most of the JIRA linked to this one are fixed or have patches associated with them. That means mos tof the pre-resuisites for supporting appends are already done. i am currently working on a distributed upgrade to upgrade a current cluster to the new disk format (HADOOP-2656).

          Once the above pre-requisites are done, the new code for supporting Appends will be started. I am guessing it will be another 1.5 man month before it appears as a draft patch.

          Show
          dhruba borthakur added a comment - Most of the JIRA linked to this one are fixed or have patches associated with them. That means mos tof the pre-resuisites for supporting appends are already done. i am currently working on a distributed upgrade to upgrade a current cluster to the new disk format ( HADOOP-2656 ). Once the above pre-requisites are done, the new code for supporting Appends will be started. I am guessing it will be another 1.5 man month before it appears as a draft patch.
          Hide
          Doug Cutting added a comment -

          Should we add another blocking issue to extend the FileSystem API to support append? This might just be adding a new method, 'FSDataOutputStream append(Path)' that throws UnsupportedOperationException by default. We could implement it in the local filesystem then too.

          This might be useful to applications (like HADOOP-3245) that might want to start writing append-aware code before HDFS append is available. Thoughts?

          Show
          Doug Cutting added a comment - Should we add another blocking issue to extend the FileSystem API to support append? This might just be adding a new method, 'FSDataOutputStream append(Path)' that throws UnsupportedOperationException by default. We could implement it in the local filesystem then too. This might be useful to applications (like HADOOP-3245 ) that might want to start writing append-aware code before HDFS append is available. Thoughts?
          Hide
          dhruba borthakur added a comment -

          Creating a new JIRA to decide on the Append API sounds great. I will create a JIRA and link it with H-1700.

          Show
          dhruba borthakur added a comment - Creating a new JIRA to decide on the Append API sounds great. I will create a JIRA and link it with H-1700.
          Hide
          Doug Cutting added a comment -

          HDFS's cache of FileStatus in each Path (HADOOP-2565) should be removed before append is implemented.

          Show
          Doug Cutting added a comment - HDFS's cache of FileStatus in each Path ( HADOOP-2565 ) should be removed before append is implemented.
          Hide
          Sameer Paranjpye added a comment -

          Dhruba, need some guidance here. Do you intend to complete work on append? Are there any append related patches ready for review or commit that you need help with?

          Show
          Sameer Paranjpye added a comment - Dhruba, need some guidance here. Do you intend to complete work on append? Are there any append related patches ready for review or commit that you need help with?
          Hide
          Robert Chansler added a comment -

          Attached a record of some of the discussions of how to begin using sequential block IDs.
          Grid_HadoopRenumberBlocks.pdf

          Show
          Robert Chansler added a comment - Attached a record of some of the discussions of how to begin using sequential block IDs. Grid_HadoopRenumberBlocks.pdf
          Hide
          dhruba borthakur added a comment -

          Hi Sameer, I would like to complete the "append" work as described in the design paper associated with this JIRA. To achieve this, I would like to get HADOOP-2656 committed. This JIRA adds a generation stamp to the Block object and is a blocker for all succeeding append-related JIRAs.

          Then, Nicholas's HADOOP-3310 should be committed followed by HADOOP-3113 that I am currently working on.

          Show
          dhruba borthakur added a comment - Hi Sameer, I would like to complete the "append" work as described in the design paper associated with this JIRA. To achieve this, I would like to get HADOOP-2656 committed. This JIRA adds a generation stamp to the Block object and is a blocker for all succeeding append-related JIRAs. Then, Nicholas's HADOOP-3310 should be committed followed by HADOOP-3113 that I am currently working on.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > To achieve this, I would like to get HADOOP-2656 committed.

          I think HADOOP-2656 is ready to be committed.

          Show
          Tsz Wo Nicholas Sze added a comment - > To achieve this, I would like to get HADOOP-2656 committed. I think HADOOP-2656 is ready to be committed.
          Hide
          Sameer Paranjpye added a comment -

          > Hi Sameer, I would like to complete the "append" work as described in the design paper associated with this JIRA

          Awesome! Thanks Dhruba. Nicholas will work with you to take this forward.

          Show
          Sameer Paranjpye added a comment - > Hi Sameer, I would like to complete the "append" work as described in the design paper associated with this JIRA Awesome! Thanks Dhruba. Nicholas will work with you to take this forward.
          Hide
          Robert Chansler added a comment -

          The current generation stamp implementation should be committed. I recorded the discussion of serial block IDs for the curious, but I don't propose changing directions. HADOOP-2656 should go!

          Show
          Robert Chansler added a comment - The current generation stamp implementation should be committed. I recorded the discussion of serial block IDs for the curious, but I don't propose changing directions. HADOOP-2656 should go!
          Hide
          Tsz Wo Nicholas Sze added a comment - - edited

          Append is almost ready.

          1700_20080606.patch: implemented DistributedFileSystem.append(...).

          TODO:

          • trigger lease recovery when a lease already exists and is expired
          • initialize write to the end of a file
          Show
          Tsz Wo Nicholas Sze added a comment - - edited Append is almost ready. 1700_20080606.patch: implemented DistributedFileSystem.append(...). TODO: trigger lease recovery when a lease already exists and is expired initialize write to the end of a file
          Hide
          dhruba borthakur added a comment -

          The patch looks good. This patch introduces the changes to the ClientProtocol. More changes will be required for the data transfer protocol between the client and datanode to implement the full functionality. I documented one of the major remaining issues in HADOOP-3515.

          Till we develop a patch for HADOOP-3515, we either hold off on this patch or check this patch in by making appendFileInternal throw an unsupported exception. If vote for creating a new JIRA to check in this patch but leave HADOOP-1700 open till we get a complete solution for appends.

          Show
          dhruba borthakur added a comment - The patch looks good. This patch introduces the changes to the ClientProtocol. More changes will be required for the data transfer protocol between the client and datanode to implement the full functionality. I documented one of the major remaining issues in HADOOP-3515 . Till we develop a patch for HADOOP-3515 , we either hold off on this patch or check this patch in by making appendFileInternal throw an unsupported exception. If vote for creating a new JIRA to check in this patch but leave HADOOP-1700 open till we get a complete solution for appends.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          So HADOOP-3515 takes care "initialize write to the end of a file".

          For "trigger lease recovery when a lease already exists and is expired", I think NameNode could throw an exception for Lease expired soft limit, then the client can go ahead to trigger the lease recovery.

          The patch also needs some tests.

          Show
          Tsz Wo Nicholas Sze added a comment - So HADOOP-3515 takes care "initialize write to the end of a file". For "trigger lease recovery when a lease already exists and is expired", I think NameNode could throw an exception for Lease expired soft limit, then the client can go ahead to trigger the lease recovery. The patch also needs some tests.
          Hide
          dhruba borthakur added a comment -

          Finally, the last patch for supporting "append" on HDFS. This patch supports a single simultaneous appender to a file.

          I purposely left in a few debug prints for now. This version of the patch is now available for review.

          Show
          dhruba borthakur added a comment - Finally, the last patch for supporting "append" on HDFS. This patch supports a single simultaneous appender to a file. I purposely left in a few debug prints for now. This version of the patch is now available for review.
          Hide
          dhruba borthakur added a comment -

          This is the final patch for supporting appends. It has an unit test that randomly triggers random-size appends to file of random size with random replication factor.

          Show
          dhruba borthakur added a comment - This is the final patch for supporting appends. It has an unit test that randomly triggers random-size appends to file of random size with random replication factor.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          appendtrunk6.patch does not apply anymore. Could you update it, Dhruba?

          Show
          Tsz Wo Nicholas Sze added a comment - appendtrunk6.patch does not apply anymore. Could you update it, Dhruba?
          Hide
          dhruba borthakur added a comment -

          Merged patch with latest trunk.

          Show
          dhruba borthakur added a comment - Merged patch with latest trunk.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Not able to review the whole patch. Below are something I found so far. Updated to appendtrunk8.patch.

          Major changes:

          • Checking permission for append in FSNamesystem
          • Surround reading steam codes by a try-finally in computePartialChunkCrc(...)
          • Making BlockInputStreams implements Closeable
          • Re-arranged the DFSOutputStream constructors for reducing code duplication
          • Re-arranged some checksum codes for reducing code duplication

          Minor changes:

          • throwing IOException("Not supported") in SimulatedFSDataset.getTmpInputStreams(...)
          • Renamed BlockReadStreams to BlockInputStreams.
          • added StringUtils.byteToHexString(byte[] bytes, int start, int end)
          • added java doc in DFSClient.append(...)
          • Checking LOG.isDebugEnabled() before calling LOG.debug(...)
          Show
          Tsz Wo Nicholas Sze added a comment - Not able to review the whole patch. Below are something I found so far. Updated to appendtrunk8.patch. Major changes: Checking permission for append in FSNamesystem Surround reading steam codes by a try-finally in computePartialChunkCrc(...) Making BlockInputStreams implements Closeable Re-arranged the DFSOutputStream constructors for reducing code duplication Re-arranged some checksum codes for reducing code duplication Minor changes: throwing IOException("Not supported") in SimulatedFSDataset.getTmpInputStreams(...) Renamed BlockReadStreams to BlockInputStreams. added StringUtils.byteToHexString(byte[] bytes, int start, int end) added java doc in DFSClient.append(...) Checking LOG.isDebugEnabled() before calling LOG.debug(...)
          Hide
          Tsz Wo Nicholas Sze added a comment -

          appendtrunk9.patch: added two more tests for testing appending to an non-existing file and permission checking

          However, TestFileAppend2 currently is failed, even for trunk + appendtrunk7.patch. Could you check it, Dhruba?

          Show
          Tsz Wo Nicholas Sze added a comment - appendtrunk9.patch: added two more tests for testing appending to an non-existing file and permission checking However, TestFileAppend2 currently is failed, even for trunk + appendtrunk7.patch. Could you check it, Dhruba?
          Hide
          dhruba borthakur added a comment -

          Thanks for the tests and associated code-changes Nicholas. I will look into why the unit test failed on your setup.

          Show
          dhruba borthakur added a comment - Thanks for the tests and associated code-changes Nicholas. I will look into why the unit test failed on your setup.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          appendtrunk10.patch: fixed a bug. Passed TestFileAppend2 on Linux but there is a rename problem in Windows.

          Show
          Tsz Wo Nicholas Sze added a comment - appendtrunk10.patch: fixed a bug. Passed TestFileAppend2 on Linux but there is a rename problem in Windows.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          appendtrunk11.patch: fixed the rename problem on Windows. The only test it fails is TestDatanodeDeath. It also fails on trunk + appendtrunk7.patch

          Show
          Tsz Wo Nicholas Sze added a comment - appendtrunk11.patch: fixed the rename problem on Windows. The only test it fails is TestDatanodeDeath. It also fails on trunk + appendtrunk7.patch
          Hide
          Tsz Wo Nicholas Sze added a comment -

          appendtrunk12.patch: removed some debug codes

          Show
          Tsz Wo Nicholas Sze added a comment - appendtrunk12.patch: removed some debug codes
          Hide
          dhruba borthakur added a comment -

          This fixes the test failure with TestDatanodeDeath.

          Show
          dhruba borthakur added a comment - This fixes the test failure with TestDatanodeDeath.
          Hide
          dhruba borthakur added a comment -

          Added TestFileAppend2 unit test.

          Show
          dhruba borthakur added a comment - Added TestFileAppend2 unit test.
          Hide
          dhruba borthakur added a comment -

          Fixed javac and javadoc warnings. This probably has two Findbugs warnings still to be fixed.

          Show
          dhruba borthakur added a comment - Fixed javac and javadoc warnings. This probably has two Findbugs warnings still to be fixed.
          Hide
          dhruba borthakur added a comment -

          Incorporated most review comments. Let's try HadoopQA tests.

          Show
          dhruba borthakur added a comment - Incorporated most review comments. Let's try HadoopQA tests.
          Hide
          dhruba borthakur added a comment -

          Fixed findbugs warnings. Many thanks to Nicholas for helping me run findbugs.

          Show
          dhruba borthakur added a comment - Fixed findbugs warnings. Many thanks to Nicholas for helping me run findbugs.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12386249/appendtrunk14.patch
          against trunk revision 677470.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 9 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2888/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2888/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2888/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2888/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12386249/appendtrunk14.patch against trunk revision 677470. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2888/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2888/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2888/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2888/console This message is automatically generated.
          Hide
          dhruba borthakur added a comment -

          Ok, got +1 from hadoopQA. I am planning to check this into trunk by the end of this week!

          Show
          dhruba borthakur added a comment - Ok, got +1 from hadoopQA. I am planning to check this into trunk by the end of this week!
          Hide
          Konstantin Shvachko added a comment -

          This is a large patch, and we don't have a test plan or any evidence that the feature was thoroughly tested.
          The unit test includes two unit tests

          1. Create one file, close, reopen and append.
          2. Run many appends in parallel to many files.

          What really need to be tested is different failure scenarios, like when the client or one of data-nodes in the pipeline fail
          at different stages of the transaction, and how name-node reacts to this. Or what happens if name-node fails in the middle
          of appends and restarted, etc. As we have seen with the lease recovery feature all these things can cause serious problems
          and should be tested in details.
          So as stated in HADOOP-2658 a test plan should be both designed and implemented before such a big feature can be committed.
          Imo it should include a lot more unit tests.

          Briefly looked at the code, noticed that you forgot to increment the ClientDatanodeProtocol.versionID although the comment is there.

          Show
          Konstantin Shvachko added a comment - This is a large patch, and we don't have a test plan or any evidence that the feature was thoroughly tested. The unit test includes two unit tests Create one file, close, reopen and append. Run many appends in parallel to many files. What really need to be tested is different failure scenarios, like when the client or one of data-nodes in the pipeline fail at different stages of the transaction, and how name-node reacts to this. Or what happens if name-node fails in the middle of appends and restarted, etc. As we have seen with the lease recovery feature all these things can cause serious problems and should be tested in details. So as stated in HADOOP-2658 a test plan should be both designed and implemented before such a big feature can be committed. Imo it should include a lot more unit tests. Briefly looked at the code, noticed that you forgot to increment the ClientDatanodeProtocol.versionID although the comment is there.
          Hide
          Robert Chansler added a comment -

          This is a substantial new feature, and 0.18 is still waiting on the fixes to previous changes needed to facilitate append.

          Konstantin's previous comment has some useful detail about necessary unit tests. Also there is a necessity for at-scale functional testing.

          Has there been any functional testing from application programs? At challenging load levels? In competition with other activity?

          Has performance testing established that append changes do not have an adverse impact on other operations?

          Does the design documentation need to be updated since November?

          What about user documentation?

          Show
          Robert Chansler added a comment - This is a substantial new feature, and 0.18 is still waiting on the fixes to previous changes needed to facilitate append. Konstantin's previous comment has some useful detail about necessary unit tests. Also there is a necessity for at-scale functional testing. Has there been any functional testing from application programs? At challenging load levels? In competition with other activity? Has performance testing established that append changes do not have an adverse impact on other operations? Does the design documentation need to be updated since November? What about user documentation?
          Hide
          Doug Cutting added a comment -

          As a new feature, the testing requirements are somewhat lower, no? Existing tests should test existing functionality, so we should have some confidence that this won't break existing applications. If new applications that use the new feature are broken then that's less of a problem. So we ought to perhaps do our normal stress-testing (e.g., sort900) before committing this, to ensure that existing features still work well.

          Things are actually a bit more complex, since, e.g., a new application that uses this new feature might crash a namenode that's shared with an existing application. One way to address this concern might be to permit the new feature (appends) to be disabled.

          That said, I think more thorough tests would be good to have, I just question whether they need to hold up committing this. Getting features into trunk earlier, well before a freeze, is also a good way to find problems, problems that might not be found with synthetic tests.

          So, my concrete proposal: Before we commit this we should,

          • make sure that large-scale tests still run well
          • add a flag to disable appends, in case they introduce namenode or datanode instabilities
          • add a 0.19 blocker that adds more append tests
          • add a 0.19 blocker that adds better append documentation

          Does that sound reasonable?

          Show
          Doug Cutting added a comment - As a new feature, the testing requirements are somewhat lower, no? Existing tests should test existing functionality, so we should have some confidence that this won't break existing applications. If new applications that use the new feature are broken then that's less of a problem. So we ought to perhaps do our normal stress-testing (e.g., sort900) before committing this, to ensure that existing features still work well. Things are actually a bit more complex, since, e.g., a new application that uses this new feature might crash a namenode that's shared with an existing application. One way to address this concern might be to permit the new feature (appends) to be disabled. That said, I think more thorough tests would be good to have, I just question whether they need to hold up committing this. Getting features into trunk earlier, well before a freeze, is also a good way to find problems, problems that might not be found with synthetic tests. So, my concrete proposal: Before we commit this we should, make sure that large-scale tests still run well add a flag to disable appends, in case they introduce namenode or datanode instabilities add a 0.19 blocker that adds more append tests add a 0.19 blocker that adds better append documentation Does that sound reasonable?
          Hide
          Nigel Daley added a comment -

          Dhruba, wasn't there a test plan for this at one point? I can't seem to find it.

          Doug's comments seem reasonable if someone is going to sign up to fix those blockers.

          In addition to Doug's list, I'd add that a test plan must be published for this feature.

          Show
          Nigel Daley added a comment - Dhruba, wasn't there a test plan for this at one point? I can't seem to find it. Doug's comments seem reasonable if someone is going to sign up to fix those blockers. In addition to Doug's list, I'd add that a test plan must be published for this feature.
          Hide
          dhruba borthakur added a comment -

          I agree with Doug. Also, this patch is pretty light-weight compared to earlier append-related patches. Most of the heavy-lifting has been done in earlier patches that are part of 0.18. This patch does not change any disk format(s).

          Making a configurable would have been an option, I will do it as part of this JIRA.

          Regarding test plan, there is a JIRA opened in Jan 08: http://issues.apache.org/jira/browse/HADOOP-2658. Also, large-scale functional and performance testing would have to be done at a site that has plenty machines. I would request Nigel/Mukund to take a initial-stab at these two items.

          Opened new JIRA to add more append-related tests. http://issues.apache.org/jira/browse/HADOOP-3790. I have assigned this to myself.

          Show
          dhruba borthakur added a comment - I agree with Doug. Also, this patch is pretty light-weight compared to earlier append-related patches. Most of the heavy-lifting has been done in earlier patches that are part of 0.18. This patch does not change any disk format(s). Making a configurable would have been an option, I will do it as part of this JIRA. Regarding test plan, there is a JIRA opened in Jan 08: http://issues.apache.org/jira/browse/HADOOP-2658 . Also, large-scale functional and performance testing would have to be done at a site that has plenty machines. I would request Nigel/Mukund to take a initial-stab at these two items. Opened new JIRA to add more append-related tests. http://issues.apache.org/jira/browse/HADOOP-3790 . I have assigned this to myself.
          Hide
          dhruba borthakur added a comment -

          Introduce a configuration option to support/disable hdfs appends.

          Show
          dhruba borthakur added a comment - Introduce a configuration option to support/disable hdfs appends.
          Hide
          Robert Chansler added a comment -

          The answer to Doug's question is, "Not necessarily." But, I'm in general agreement with his proposal with the understanding that the new blockers are assigned in advance.

          Show
          Robert Chansler added a comment - The answer to Doug's question is, "Not necessarily." But, I'm in general agreement with his proposal with the understanding that the new blockers are assigned in advance.
          Hide
          dhruba borthakur added a comment -

          If dfs.support.append is set to false, then the "append" feature to hdfs will be switched off.

          Show
          dhruba borthakur added a comment - If dfs.support.append is set to false, then the "append" feature to hdfs will be switched off.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12386453/appendtrunk15.patch
          against trunk revision 678080.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 9 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2913/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2913/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2913/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2913/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12386453/appendtrunk15.patch against trunk revision 678080. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2913/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2913/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2913/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2913/console This message is automatically generated.
          Hide
          dhruba borthakur added a comment -

          Hi Nigel/Mukund,

          I understand that large-scale performance testing and a test-plan are two things that are keeping this issue from being checked in. Would appreciate it a lot if any of you can help in this regard. thanks in advance,

          dhruba

          Show
          dhruba borthakur added a comment - Hi Nigel/Mukund, I understand that large-scale performance testing and a test-plan are two things that are keeping this issue from being checked in. Would appreciate it a lot if any of you can help in this regard. thanks in advance, dhruba
          Hide
          Konstantin Shvachko added a comment -

          As far as I understood the latest discussion the prerequisites for this patch is the Doug's proposal plus the test plan.
          I'll repeat them here.

          • make sure that large-scale tests still run well
          • add a flag to disable appends, in case they introduce namenode or datanode instabilities (done)
          • add a 0.19 blocker that adds more append tests
          • add a 0.19 blocker that adds better append documentation
          • Develop and implement test plan (HADOOP-2658)

          The biggest issues are the large scale test and the test plan.
          I think if we implement the test plan the large scale test would go much smoother.
          How large is large enough? A 100 would do it imo.

          Show
          Konstantin Shvachko added a comment - As far as I understood the latest discussion the prerequisites for this patch is the Doug's proposal plus the test plan. I'll repeat them here. make sure that large-scale tests still run well add a flag to disable appends, in case they introduce namenode or datanode instabilities ( done ) add a 0.19 blocker that adds more append tests add a 0.19 blocker that adds better append documentation Develop and implement test plan ( HADOOP-2658 ) The biggest issues are the large scale test and the test plan. I think if we implement the test plan the large scale test would go much smoother. How large is large enough? A 100 would do it imo.
          Hide
          Doug Cutting added a comment -

          > Develop and implement test plan (HADOOP-2658)

          I thought we could commit 1700 before we implement all the tests, provided the tests were a 0.19 blocker and folks were committed to implement them ASAP.

          Show
          Doug Cutting added a comment - > Develop and implement test plan ( HADOOP-2658 ) I thought we could commit 1700 before we implement all the tests, provided the tests were a 0.19 blocker and folks were committed to implement them ASAP.
          Hide
          Sameer Paranjpye added a comment -

          +1

          Dhruba, if you can author a test plan in HADOOP-2658, we can work with you to implement and run the tests at scale.

          Show
          Sameer Paranjpye added a comment - +1 Dhruba, if you can author a test plan in HADOOP-2658 , we can work with you to implement and run the tests at scale.
          Hide
          dhruba borthakur added a comment -

          Ok, i will author a test plan. Will post the test plan to HADOOP-2658.

          Show
          dhruba borthakur added a comment - Ok, i will author a test plan. Will post the test plan to HADOOP-2658 .
          Hide
          dhruba borthakur added a comment -

          I had an offline discussion with Nicholas, Sameer, Rob Chansler and Nigel. We agreed that we will open a couple od JIRAs (marked as blockers for 0.19), these will be related to documentation, writing more unit tests, creating a append-benchmark and running large-scale tests to validate existing behaviour and performance. Rob has volunteered to create these JIRAs. Once these are created, we should check in this patch.

          This means that it is likely that this patch gets checked into trunk in a day or two.

          Show
          dhruba borthakur added a comment - I had an offline discussion with Nicholas, Sameer, Rob Chansler and Nigel. We agreed that we will open a couple od JIRAs (marked as blockers for 0.19), these will be related to documentation, writing more unit tests, creating a append-benchmark and running large-scale tests to validate existing behaviour and performance. Rob has volunteered to create these JIRAs. Once these are created, we should check in this patch. This means that it is likely that this patch gets checked into trunk in a day or two.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          +1 appendtrunk15.patch looks good

          Show
          Tsz Wo Nicholas Sze added a comment - +1 appendtrunk15.patch looks good
          Hide
          Konstantin Shvachko added a comment -

          ClientDatanodeProtocol version is still 1. Should be 2.

          Show
          Konstantin Shvachko added a comment - ClientDatanodeProtocol version is still 1. Should be 2.
          Hide
          dhruba borthakur added a comment -

          Incorporated Konstantin's review comments.

          Show
          dhruba borthakur added a comment - Incorporated Konstantin's review comments.
          Hide
          dhruba borthakur added a comment -

          Rerun HadoopQA one final time!

          Show
          dhruba borthakur added a comment - Rerun HadoopQA one final time!
          Hide
          dhruba borthakur added a comment -

          Rerun HadoopQA one final time!

          Show
          dhruba borthakur added a comment - Rerun HadoopQA one final time!
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12386795/appendtrunk16.patch
          against trunk revision 679601.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 9 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2944/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2944/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2944/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2944/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12386795/appendtrunk16.patch against trunk revision 679601. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2944/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2944/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2944/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2944/console This message is automatically generated.
          Hide
          dhruba borthakur added a comment -

          Appends to HDFS is finally committed! Many thanks to many many people who contributed to the design, development and testing. Special thanks to Nicholas who developed some key pieces related to this patch!

          Show
          dhruba borthakur added a comment - Appends to HDFS is finally committed! Many thanks to many many people who contributed to the design, development and testing. Special thanks to Nicholas who developed some key pieces related to this patch!
          Hide
          dhruba borthakur added a comment -

          I just committed this!

          Show
          dhruba borthakur added a comment - I just committed this!
          Hide
          stack added a comment -

          Whoopdee!

          Show
          stack added a comment - Whoopdee!
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #581 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/581/ )

            People

            • Assignee:
              dhruba borthakur
              Reporter:
              stack
            • Votes:
              11 Vote for this issue
              Watchers:
              31 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development