Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-1060 Append/flush should support concurrent "tailer" use case
  3. HDFS-1057

Concurrent readers hit ChecksumExceptions if following a writer to very end of file

    Details

    • Hadoop Flags:
      Reviewed

      Description

      In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before calling flush(). Therefore, if there is a concurrent reader, it's possible to race here - the reader will see the new length while those bytes are still in the buffers of BlockReceiver. Thus the client will potentially see checksum errors or EOFs. Additionally, the last checksum chunk of the file is made accessible to readers even though it is not stable.

      1. conurrent-reader-patch-1.txt
        30 kB
        sam rash
      2. conurrent-reader-patch-2.txt
        31 kB
        sam rash
      3. conurrent-reader-patch-3.txt
        34 kB
        sam rash
      4. HDFS-1057.20-security.1.patch
        34 kB
        Jitendra Nath Pandey
      5. HDFS-1057-0.20-append.patch
        35 kB
        Nicolas Spiegelberg
      6. hdfs-1057-trunk-1.txt
        27 kB
        sam rash
      7. hdfs-1057-trunk-2.txt
        26 kB
        sam rash
      8. hdfs-1057-trunk-3.txt
        25 kB
        sam rash
      9. hdfs-1057-trunk-4.txt
        29 kB
        sam rash
      10. hdfs-1057-trunk-5.txt
        30 kB
        sam rash
      11. hdfs-1057-trunk-6.txt
        29 kB
        sam rash

        Issue Links

          Activity

          Hide
          Todd Lipcon added a comment -

          This problem is worse than originally reported. Switching the order of flush and setBytesOnDisk doesn't solve the problem, because the last checksum in the meta file is still changing. So, since we don't access the data synchronously with the checksum, a client trying to read the last several bytes of a file under construction will get checksum errors.

          Solving this is likely to be very tricky...

          Show
          Todd Lipcon added a comment - This problem is worse than originally reported. Switching the order of flush and setBytesOnDisk doesn't solve the problem, because the last checksum in the meta file is still changing. So, since we don't access the data synchronously with the checksum, a client trying to read the last several bytes of a file under construction will get checksum errors. Solving this is likely to be very tricky...
          Hide
          sam rash added a comment -

          Hi,

          I'm working on getting this functionality into an internal 20-based rev as well. Here are some solutions I've thought of (and tried 1 & 2) in case it helps discussion towards a good sol'n:

          solutions (some tried, some theories):
          1. truncate in progress blocks to chunk boundaries. This solved this problem, but fails as technically sync'd data is not available (partial chunk at the end)
          2. when reader's request results in an artificial partial chunk (in BlockSender), recompute the checksum for the partial chunk in the packet
          -functionally solves the problem
          -introduces inefficiency since the socket to socket copy can't be done
          could possibly improve efficiency by having client break packet requests for anything that has multiple chunks with a partial chunk at the end into two-partial chunk goes as a lone request
          -results in one more packet going back & forth, but the previous full chunks can do more efficient socket to socket copies
          3. have datanode 'refresh' the length when it actually starts reading
          -will send more data than the client requests
          -checksum will match data
          -client does truncation of data after crc check
          -still has race condition with checksum for last chunk being out of sync with data from a given reader's perspective

          Show
          sam rash added a comment - Hi, I'm working on getting this functionality into an internal 20-based rev as well. Here are some solutions I've thought of (and tried 1 & 2) in case it helps discussion towards a good sol'n: solutions (some tried, some theories): 1. truncate in progress blocks to chunk boundaries. This solved this problem, but fails as technically sync'd data is not available (partial chunk at the end) 2. when reader's request results in an artificial partial chunk (in BlockSender), recompute the checksum for the partial chunk in the packet -functionally solves the problem -introduces inefficiency since the socket to socket copy can't be done could possibly improve efficiency by having client break packet requests for anything that has multiple chunks with a partial chunk at the end into two -partial chunk goes as a lone request -results in one more packet going back & forth, but the previous full chunks can do more efficient socket to socket copies 3. have datanode 'refresh' the length when it actually starts reading -will send more data than the client requests -checksum will match data -client does truncation of data after crc check -still has race condition with checksum for last chunk being out of sync with data from a given reader's perspective
          Hide
          Todd Lipcon added a comment -

          Hey Sam,

          Thanks for posting some thoughts here. I've been thinking about this one over the past couple of weeks and don't have any particularly great solutions either :-/

          1. truncate in progress blocks to chunk boundaries. This solved this problem, but fails as technically sync'd data is not available (partial chunk at the end)

          I agree this isn't very acceptable - a common use case is following an edit log, and we don't want to lose the last 512 bytes of edits from the reader.

          when reader's request results in an artificial partial chunk (in BlockSender), recompute the checksum for the partial chunk in the packet

          So then we're throwing away disk checksumming on the last chunk of a file? I thought of this one too, and I think it's OK, though it's not great. I think so long as we only do this when the file is known to be Under Construction, it's acceptable. We'll have to be careful, though, about races between checking whether we're reading the last checksum chunk and appending more to the file.

          have datanode 'refresh' the length when it actually starts reading

          Not sure I follow - as you noted it still seems to have a bug potential.

          The other idea I had that isn't fully fleshed out is this:

          • The DN already knows which blocks are "in progress"
          • For each "in progress" block, we add a lock updatingLastChecksumChunkLock.
          • When the writer detects that it's appending mid-checksum-chunk to the last chunk, it takes this lock before doing the rewind and rewrite of meta.
          • The writer releases the lock after writing both the checksum and the data
          • When the reader is reading a range that includes the last checksum chunk of an under-construction file, it needs to acquire this lock first,. read both the checksum and data, then release the lock.

          Couple potential issues:

          • We'll have to be careful about races between the under-construction check and the read - perhaps some more coordination with FSDataset is necessary here (we already have to coordinate this a little bit to figure out whether to read out of the rbw directory or the finalized directory, right?)
          • We still have a potential issue about failure recovery - I think we need to add some code to the DN recovery that cleans up the last partial chunk in RBW replicas.

          One other more complicated idea:

          • Next to each RBW file, in addition to having the meta/CRC32 file, we can add a blk_N.rbwMetaData - we can keep some extra state in this file about the last partial checksum chunk, and use it for getting consistent reads.
          Show
          Todd Lipcon added a comment - Hey Sam, Thanks for posting some thoughts here. I've been thinking about this one over the past couple of weeks and don't have any particularly great solutions either :-/ 1. truncate in progress blocks to chunk boundaries. This solved this problem, but fails as technically sync'd data is not available (partial chunk at the end) I agree this isn't very acceptable - a common use case is following an edit log, and we don't want to lose the last 512 bytes of edits from the reader. when reader's request results in an artificial partial chunk (in BlockSender), recompute the checksum for the partial chunk in the packet So then we're throwing away disk checksumming on the last chunk of a file? I thought of this one too, and I think it's OK, though it's not great. I think so long as we only do this when the file is known to be Under Construction, it's acceptable. We'll have to be careful, though, about races between checking whether we're reading the last checksum chunk and appending more to the file. have datanode 'refresh' the length when it actually starts reading Not sure I follow - as you noted it still seems to have a bug potential. The other idea I had that isn't fully fleshed out is this: The DN already knows which blocks are "in progress" For each "in progress" block, we add a lock updatingLastChecksumChunkLock . When the writer detects that it's appending mid-checksum-chunk to the last chunk, it takes this lock before doing the rewind and rewrite of meta. The writer releases the lock after writing both the checksum and the data When the reader is reading a range that includes the last checksum chunk of an under-construction file, it needs to acquire this lock first,. read both the checksum and data, then release the lock. Couple potential issues: We'll have to be careful about races between the under-construction check and the read - perhaps some more coordination with FSDataset is necessary here (we already have to coordinate this a little bit to figure out whether to read out of the rbw directory or the finalized directory, right?) We still have a potential issue about failure recovery - I think we need to add some code to the DN recovery that cleans up the last partial chunk in RBW replicas. One other more complicated idea: Next to each RBW file, in addition to having the meta/CRC32 file, we can add a blk_N.rbwMetaData - we can keep some extra state in this file about the last partial checksum chunk, and use it for getting consistent reads.
          Hide
          Todd Lipcon added a comment -

          BTW, this issue is actually really dangerous - it doesn't just cause checksum exceptions for the reader, but the reader then goes and reports the blocks as corrupt to the NN, and the entire block becomes unreadable!

          Would be interested to hear from the Yahoo folks on this one.

          Show
          Todd Lipcon added a comment - BTW, this issue is actually really dangerous - it doesn't just cause checksum exceptions for the reader, but the reader then goes and reports the blocks as corrupt to the NN, and the entire block becomes unreadable! Would be interested to hear from the Yahoo folks on this one.
          Hide
          Hairong Kuang added a comment -

          > Would be interested to hear from the Yahoo folks on this one.

          Thanks Sam for his initiative on this problem and had a brief off-line discussion with him. I personally like solution 2: recompute the checksum for the partial last chunk. This was also what came up in my mind when Todd filed the jira.

          Solution 1 violates semantics. Solution 3 needs to extend with Todd's suggestion. However this means readers may significantly slow down the writer if there are large amount of read requests.

          Show
          Hairong Kuang added a comment - > Would be interested to hear from the Yahoo folks on this one. Thanks Sam for his initiative on this problem and had a brief off-line discussion with him. I personally like solution 2: recompute the checksum for the partial last chunk. This was also what came up in my mind when Todd filed the jira. Solution 1 violates semantics. Solution 3 needs to extend with Todd's suggestion. However this means readers may significantly slow down the writer if there are large amount of read requests.
          Hide
          Todd Lipcon added a comment -

          Solution 2 should work, but we still need to add to the recovery code to recompute the last chunk checksum during rbw recovery, right? (ie a DN could crash in between flushing data and metadata)

          Show
          Todd Lipcon added a comment - Solution 2 should work, but we still need to add to the recovery code to recompute the last chunk checksum during rbw recovery, right? (ie a DN could crash in between flushing data and metadata)
          Hide
          sam rash added a comment -

          thanks for the additional info.

          fwiw, I have an impl of #2 that does the slightly more efficient way: the BlockSender.sendChunks() examines 'len' and if % bytesPerChecksum != 0, it truncates len to a chunk boundary. It can do this since it returns len. The result is that what was a single packet to the receiver is now two, but the first one can be done with transferToFully() using existing checksums and the lone partial chunk has its own packet (and in this case, there's the extra buffer copy in order to recompute the checksum.

          For this attempt, I punted on figuring out if the block is in progress or not--I'm ok with the slight inefficiency if it avoids race conditions. I believe we might be able to address this with some synchronization around state changes to a block being an ongoing create or not.

          Can you comment on where the recovery code is that I also need to tweak? (very new to working on hdfs)

          I still need to do some more testing on it and clean it up; also, is there an existing unit test for this case yet? (there certainly isn't in 0.20) I am also going to try to construct one for that.

          Show
          sam rash added a comment - thanks for the additional info. fwiw, I have an impl of #2 that does the slightly more efficient way: the BlockSender.sendChunks() examines 'len' and if % bytesPerChecksum != 0, it truncates len to a chunk boundary. It can do this since it returns len. The result is that what was a single packet to the receiver is now two, but the first one can be done with transferToFully() using existing checksums and the lone partial chunk has its own packet (and in this case, there's the extra buffer copy in order to recompute the checksum. For this attempt, I punted on figuring out if the block is in progress or not--I'm ok with the slight inefficiency if it avoids race conditions. I believe we might be able to address this with some synchronization around state changes to a block being an ongoing create or not. Can you comment on where the recovery code is that I also need to tweak? (very new to working on hdfs) I still need to do some more testing on it and clean it up; also, is there an existing unit test for this case yet? (there certainly isn't in 0.20) I am also going to try to construct one for that.
          Hide
          Hairong Kuang added a comment -

          > need to add to the recovery code to recompute the last chunk checksum during rbw recovery, right?
          In the trunk, this is not an issue. When DN startup, rbw is changed to be rwr (ReplicaWaitingToBeRecover), during it checks the last chunk and find out the number of bytes in the block that match its crc. If crc does not match, the last chunk gets thrown away. This does not violates hflush (sync) semantics because in this case some error occurred and other replicas may still have a good copy.

          But I am not sure about 0.20. I think it does something similar. Anyway, I think we could ignore startup for the issue.

          > is there an existing unit test for this case yet?
          Todd posted some good concurrent reader tests at HDFS-1060. Please check if you could use them.

          Show
          Hairong Kuang added a comment - > need to add to the recovery code to recompute the last chunk checksum during rbw recovery, right? In the trunk, this is not an issue. When DN startup, rbw is changed to be rwr (ReplicaWaitingToBeRecover), during it checks the last chunk and find out the number of bytes in the block that match its crc. If crc does not match, the last chunk gets thrown away. This does not violates hflush (sync) semantics because in this case some error occurred and other replicas may still have a good copy. But I am not sure about 0.20. I think it does something similar. Anyway, I think we could ignore startup for the issue. > is there an existing unit test for this case yet? Todd posted some good concurrent reader tests at HDFS-1060 . Please check if you could use them.
          Hide
          sam rash added a comment -

          re: recovery code
          I will check with dhruba about how 20 handles the recovery (not familiar with that part of the code yet)

          re: tests
          great, thanks. I was able to construct a case that deterministically fails:

          1. writer opens a file, wites and syncs some # of bytes that includes a partial chunk
          2. reader opens that stream, reads some bytes (to make the datanode open the meta data and block data streams)
          3. writer writes additional bytes that fill out the partial chunk
          4. reader continues reading to the end of file when it opened
          5. reader throws CRC error

          I will see if I can construct deterministic test cases for these other ones or use directly as well.

          thanks again

          Show
          sam rash added a comment - re: recovery code I will check with dhruba about how 20 handles the recovery (not familiar with that part of the code yet) re: tests great, thanks. I was able to construct a case that deterministically fails: 1. writer opens a file, wites and syncs some # of bytes that includes a partial chunk 2. reader opens that stream, reads some bytes (to make the datanode open the meta data and block data streams) 3. writer writes additional bytes that fill out the partial chunk 4. reader continues reading to the end of file when it opened 5. reader throws CRC error I will see if I can construct deterministic test cases for these other ones or use directly as well. thanks again
          Hide
          sam rash added a comment -

          update:

          problem refinement/issues:

          First, we can't recompute all partial chunk CRCs without a regression: blocks on disk that are not being written to may have errors that we would miss. This means we have to be more careful and effectively only recompute a checksum that we can guarantee is mismatched or very very likely to be wrong.

          After trying out various ways of detecting if blocks changed and recomputing the CRC for partial chunks, I generalized the problem to this: there is data length which is the size of the data on disk, and metadata length which is the amount of data for which we have meta data (poor name--need a better one). The problem with concurrent reads/write is that these get out of sync.

          Problems that could occur or have actually occurred:

          1. CRC error1: BlockSender is created when data length is > metadata length. Produces a crc error in that the crc in the last chunk is for less data
          2. CRC error2: when reading from disk, more metadata is available than blockLength (data size is in fact greater than when BlockSender was created)
          3. EOF error: similar to 1, but in theory EOF could be be encountered when reading checksumIn (haven't seen yet)

          Solution:
          1. we need to guarantee when we create a BlockSender (or anything that will direct reading of data), blockLength <= metadata length <= data length
          (blockLength being what we consider available to read)
          2. we detect when the block changes after we get the blockLength (create BlockSender) in order to know if we should recompute partial chunk CRCs

          In this way, if we guarantee #1, and implement #2, we can know when the CRC will be invalid in recompute them w/o any regression

          For #1, we already have ongoing creates concept which keeps some in-memory information about blocks being written to (new or append). We add a 'visible length' attribute which we also expose in FSDatasetInterface.getVisibleLength(Block b). This is an in memory length of the block that we update only after flushing both data and metadata to disk. For blocks not being appended to, this function delegates to the original FSDatasetInterface.getLength(Block b) which is unchanged. In this way, if BlockSender uses this for the blockLength, #1 above holds.

          For #2, we 'memoize' some info about the block when we create it. In particular, we get the actual length of the block on disk and store it. On each packet send, we compare the expected length of our inputstream to what is actually there. If there is more data, the block has changed and the CRC data can't be trusted for the last partial chunk.

          Test cases : inspired by Todd's (could not apply patch to 0.20), I create two threads and have one write very small data sizes and call sync. Another thread opens the file, reads to the end, remembers that position, and starts off there again and resumes. This reliably produces the errors within a few seconds and runs 10-15s when there is no error (tunable). In order to detect data corruption (which I saw in some particular bugs), I made the data written such that you can tell byte X is X % 127. This helps catch issues that CRC recomputation might hide.

          Show
          sam rash added a comment - update: problem refinement/issues: First, we can't recompute all partial chunk CRCs without a regression: blocks on disk that are not being written to may have errors that we would miss. This means we have to be more careful and effectively only recompute a checksum that we can guarantee is mismatched or very very likely to be wrong. After trying out various ways of detecting if blocks changed and recomputing the CRC for partial chunks, I generalized the problem to this: there is data length which is the size of the data on disk, and metadata length which is the amount of data for which we have meta data (poor name--need a better one). The problem with concurrent reads/write is that these get out of sync. Problems that could occur or have actually occurred: 1. CRC error1: BlockSender is created when data length is > metadata length. Produces a crc error in that the crc in the last chunk is for less data 2. CRC error2: when reading from disk, more metadata is available than blockLength (data size is in fact greater than when BlockSender was created) 3. EOF error: similar to 1, but in theory EOF could be be encountered when reading checksumIn (haven't seen yet) Solution: 1. we need to guarantee when we create a BlockSender (or anything that will direct reading of data), blockLength <= metadata length <= data length (blockLength being what we consider available to read) 2. we detect when the block changes after we get the blockLength (create BlockSender) in order to know if we should recompute partial chunk CRCs In this way, if we guarantee #1, and implement #2, we can know when the CRC will be invalid in recompute them w/o any regression For #1, we already have ongoing creates concept which keeps some in-memory information about blocks being written to (new or append). We add a 'visible length' attribute which we also expose in FSDatasetInterface.getVisibleLength(Block b). This is an in memory length of the block that we update only after flushing both data and metadata to disk. For blocks not being appended to, this function delegates to the original FSDatasetInterface.getLength(Block b) which is unchanged. In this way, if BlockSender uses this for the blockLength, #1 above holds. For #2, we 'memoize' some info about the block when we create it. In particular, we get the actual length of the block on disk and store it. On each packet send, we compare the expected length of our inputstream to what is actually there. If there is more data, the block has changed and the CRC data can't be trusted for the last partial chunk. Test cases : inspired by Todd's (could not apply patch to 0.20), I create two threads and have one write very small data sizes and call sync. Another thread opens the file, reads to the end, remembers that position, and starts off there again and resumes. This reliably produces the errors within a few seconds and runs 10-15s when there is no error (tunable). In order to detect data corruption (which I saw in some particular bugs), I made the data written such that you can tell byte X is X % 127. This helps catch issues that CRC recomputation might hide.
          Hide
          Hairong Kuang added a comment -

          > 1. CRC error1: BlockSender is created when data length is > metadata length.
          Could you please explain what do you mean data length and meatadata length?

          Show
          Hairong Kuang added a comment - > 1. CRC error1: BlockSender is created when data length is > metadata length. Could you please explain what do you mean data length and meatadata length?
          Hide
          Hairong Kuang added a comment -

          In the trunk I think there is a more elegant way of solving the problem.

          In each ReplcaBeingWritten, we could have two more fields to keep track of the last consistent state: replica length and the last chunk's crc. The last consistent state is defined as the replica length right after the most recent packet has been flushed to the disk. So when a read request comes in, if the last byte requested falls into a being-written last chunk, the datanode returns the last chunk upto the last consistent state. The crc is read from memory in stead of reading from the disk.

          This will also make the replica recovery problem filed in HDFS-1103 easier to solve.

          Show
          Hairong Kuang added a comment - In the trunk I think there is a more elegant way of solving the problem. In each ReplcaBeingWritten, we could have two more fields to keep track of the last consistent state: replica length and the last chunk's crc. The last consistent state is defined as the replica length right after the most recent packet has been flushed to the disk. So when a read request comes in, if the last byte requested falls into a being-written last chunk, the datanode returns the last chunk upto the last consistent state. The crc is read from memory in stead of reading from the disk. This will also make the replica recovery problem filed in HDFS-1103 easier to solve.
          Hide
          sam rash added a comment -

          sorry, as noted, that's an awful name: "metadata length which is the amount of data for which we have meta data"
          ie, the length of data for which we have valid crc data. data length is always growing first so the solution here was to track the metadata length as 'visible block length'. this doesn't cleanly solve the unstable last checksum problem which is why we detect that change and recompute.

          Show
          sam rash added a comment - sorry, as noted, that's an awful name: "metadata length which is the amount of data for which we have meta data" ie, the length of data for which we have valid crc data. data length is always growing first so the solution here was to track the metadata length as 'visible block length'. this doesn't cleanly solve the unstable last checksum problem which is why we detect that change and recompute.
          Hide
          dhruba borthakur added a comment -

          It appears that Sam is currently working on this patch.

          Show
          dhruba borthakur added a comment - It appears that Sam is currently working on this patch.
          Hide
          Todd Lipcon added a comment -

          Sam, are you working on a patch for trunk HDFS here or just the 20 branch append feature?

          Show
          Todd Lipcon added a comment - Sam, are you working on a patch for trunk HDFS here or just the 20 branch append feature?
          Hide
          sam rash added a comment -

          i have a 0.20 patch we've tested and i plan to port it fwd to trunk and adopt some changes similar to what Hairong laid out (hopefully in the next couple of weeks). is there a more urgent need for it in trunk? do u want me to post the 0.20-fb version ?

          Show
          sam rash added a comment - i have a 0.20 patch we've tested and i plan to port it fwd to trunk and adopt some changes similar to what Hairong laid out (hopefully in the next couple of weeks). is there a more urgent need for it in trunk? do u want me to post the 0.20-fb version ?
          Hide
          Todd Lipcon added a comment -

          Hey Sam. No urgent need, except that we'd like to close some of the HDFS blockers for the upcoming release that Tom is working on. If you're on it, great! Let me know if I can be of help.

          Regarding the 20 version, I'd be happy to take a look, thanks.

          Show
          Todd Lipcon added a comment - Hey Sam. No urgent need, except that we'd like to close some of the HDFS blockers for the upcoming release that Tom is working on. If you're on it, great! Let me know if I can be of help. Regarding the 20 version, I'd be happy to take a look, thanks.
          Hide
          sam rash added a comment -

          0.20 test + patch

          Show
          sam rash added a comment - 0.20 test + patch
          Hide
          Jean-Daniel Cryans added a comment -

          Sam, can you base the patch on hadoop's root folder? It's kinda hard to apply as is.

          Show
          Jean-Daniel Cryans added a comment - Sam, can you base the patch on hadoop's root folder? It's kinda hard to apply as is.
          Hide
          sam rash added a comment -

          oops, thought i had passed the right options to git diff. will update in a bit

          in the meantime,

          patch -p3 < patch.txt

          will work

          Show
          sam rash added a comment - oops, thought i had passed the right options to git diff. will update in a bit in the meantime, patch -p3 < patch.txt will work
          Hide
          sam rash added a comment -

          based on hadoop root dir

          Show
          sam rash added a comment - based on hadoop root dir
          Hide
          sam rash added a comment -

          note: this patch is incomplete--there's effectively a 1-line change needed if you happen to deal with very small writes + sync (if you sync lots of writes < chunk size).

          i have a 2nd patch that includes test + fix for this for 0.20--need to merge into a unified patch and will post

          Show
          sam rash added a comment - note: this patch is incomplete--there's effectively a 1-line change needed if you happen to deal with very small writes + sync (if you sync lots of writes < chunk size). i have a 2nd patch that includes test + fix for this for 0.20--need to merge into a unified patch and will post
          Hide
          sam rash added a comment -

          -includes all patches around concurrent reader crc problems for the 0.20 port
          -fixed so patch should include full unit tests (previously depended on other commit not included)

          Show
          sam rash added a comment - -includes all patches around concurrent reader crc problems for the 0.20 port -fixed so patch should include full unit tests (previously depended on other commit not included)
          Hide
          sam rash added a comment -

          todd pointed out i was missing an essential hunk in FSNameSystem--added it back in

          Show
          sam rash added a comment - todd pointed out i was missing an essential hunk in FSNameSystem--added it back in
          Hide
          Todd Lipcon added a comment -

          Some comments:

          • IOUtils.readFileChannelFully:
            • IOUtils.readFully spelled "premature" wrong, but we might as well not duplicate the typo
            • no need to increment the 'off' variable since the bytebuffer internally deals with that
            • can you use ByteBuffer.hasRemaining() in the loop instead of updating the toRead variable?
          • ChecksumUtil:
            • Missing apache license header
            • Since it only contains static methods, either make it an abstract class or give it a private constructor
            • Perhaps add an assert like: assert checksumOff + numChunks * checksumSize < dataOff; (to make sure there's enough space in the checksum buffer)
          • Generally, have you verified through strace or other means that transferto is still being used properly for normal transfers? The logic in BlockSender is getting very complicated, it's hard to verify by looking at it.
          Show
          Todd Lipcon added a comment - Some comments: IOUtils.readFileChannelFully: IOUtils.readFully spelled "premature" wrong, but we might as well not duplicate the typo no need to increment the 'off' variable since the bytebuffer internally deals with that can you use ByteBuffer.hasRemaining() in the loop instead of updating the toRead variable? ChecksumUtil: Missing apache license header Since it only contains static methods, either make it an abstract class or give it a private constructor Perhaps add an assert like: assert checksumOff + numChunks * checksumSize < dataOff; (to make sure there's enough space in the checksum buffer) Generally, have you verified through strace or other means that transferto is still being used properly for normal transfers? The logic in BlockSender is getting very complicated, it's hard to verify by looking at it.
          Hide
          sam rash added a comment -

          hey todd,

          regarding the 0.20 patch, i was planning on leaving it as-is and working on the patch for trunk. that one will be up to snuff and have all the headers and hopefully be cleaner. otherwise, I'll have 3 concurrent branches to keep going (internal 0.20, 0.20 'refined', and trunk)

          does that work for you? or do you need a cleaned up 0.20 version for something in particular?

          Show
          sam rash added a comment - hey todd, regarding the 0.20 patch, i was planning on leaving it as-is and working on the patch for trunk. that one will be up to snuff and have all the headers and hopefully be cleaner. otherwise, I'll have 3 concurrent branches to keep going (internal 0.20, 0.20 'refined', and trunk) does that work for you? or do you need a cleaned up 0.20 version for something in particular?
          Hide
          Todd Lipcon added a comment -

          That sounds fine - I was making those comments so you could incorporate them into the trunk patch when you get there.

          I am still curious about whether you verified that transferTo still works (either via benchmark or instrumentation/strace)

          Show
          Todd Lipcon added a comment - That sounds fine - I was making those comments so you could incorporate them into the trunk patch when you get there. I am still curious about whether you verified that transferTo still works (either via benchmark or instrumentation/strace)
          Hide
          sam rash added a comment -

          ah, sorry, meant to respond to that: I ran the unit tests I have the exercise that in a debugger and set breakpoints to verify the code path is exercised.

          i agree it's a complicated bit of code-I cleaned it up a tiny bit. I think in reality breaking out into 2 subclasses/methods that do the sending of packets (sendChunks method)one for transferTo and one for the regular path. probably separate classes (non-static inner classes so as to use parent class state maybe-haven't thought the details through yet).

          Show
          sam rash added a comment - ah, sorry, meant to respond to that: I ran the unit tests I have the exercise that in a debugger and set breakpoints to verify the code path is exercised. i agree it's a complicated bit of code- I cleaned it up a tiny bit. I think in reality breaking out into 2 subclasses/methods that do the sending of packets (sendChunks method) one for transferTo and one for the regular path. probably separate classes (non-static inner classes so as to use parent class state maybe -haven't thought the details through yet).
          Hide
          Todd Lipcon added a comment -

          Yep, some refactor is certainly due there, but out of scope here. Thanks!

          Show
          Todd Lipcon added a comment - Yep, some refactor is certainly due there, but out of scope here. Thanks!
          Hide
          sam rash added a comment -

          @hairong:

          I'm looking a little at implementing this in trunk (reading your append/hflush doc from hdfs-265), and I have a question. From above:

          "In each ReplcaBeingWritten, we could have two more fields to keep track of the last consistent state: replica length and the last chunk's crc"

          why does there need to be another length field? the getVisibleLenght() == acked bytes isn't sufficient? if the crc stored in the RBW is for that length, you only need the additional byte[] field which is the last chunk's crc I think.

          ReplicaBeingWritten.setBytesAcked() could take the crc and atomically set the len + bytes

          Show
          sam rash added a comment - @hairong: I'm looking a little at implementing this in trunk (reading your append/hflush doc from hdfs-265), and I have a question. From above: "In each ReplcaBeingWritten, we could have two more fields to keep track of the last consistent state: replica length and the last chunk's crc" why does there need to be another length field? the getVisibleLenght() == acked bytes isn't sufficient? if the crc stored in the RBW is for that length, you only need the additional byte[] field which is the last chunk's crc I think. ReplicaBeingWritten.setBytesAcked() could take the crc and atomically set the len + bytes
          Hide
          sam rash added a comment -

          hmm, in looking at the code more, I see that this depends on what # of bytes we want to make available to readers.

          -visible length (bytes acked) : needed for consistent view of data I think
          -bytes on disk : this seems like we'll get inconsistent reads; and in theory, acked data may be more than on disk for a given node (if I read the doc + code right). However, how can a DN send data that's not on disk unless it's made available via memory? (very complex)

          Show
          sam rash added a comment - hmm, in looking at the code more, I see that this depends on what # of bytes we want to make available to readers. -visible length (bytes acked) : needed for consistent view of data I think -bytes on disk : this seems like we'll get inconsistent reads; and in theory, acked data may be more than on disk for a given node (if I read the doc + code right). However, how can a DN send data that's not on disk unless it's made available via memory? (very complex)
          Hide
          Hairong Kuang added a comment -

          Sam, the replica length in my comment meant the visible length so no new length field is needed. The read algorithm in trunk simplifies the read algorithm in the design doc.

          A block reader does read until its disk length >= the request # of bytes. To solve the last chunk CRC problem, I am thinking to store last chunk's crc in ReplicaBeingWritten after the chunk is received. A block reader returns visible length of bytes and last chunk crc from memory. This should work because # of requested bytes <= disk length <= visible length.

          Show
          Hairong Kuang added a comment - Sam, the replica length in my comment meant the visible length so no new length field is needed. The read algorithm in trunk simplifies the read algorithm in the design doc. A block reader does read until its disk length >= the request # of bytes. To solve the last chunk CRC problem, I am thinking to store last chunk's crc in ReplicaBeingWritten after the chunk is received. A block reader returns visible length of bytes and last chunk crc from memory. This should work because # of requested bytes <= disk length <= visible length.
          Hide
          Hairong Kuang added a comment -

          When I thought more about it, the replica length in my previous comment should be replica's disk length. So after a chunk is flushed to disk, a block receiver updates its disk length (as in trunk) and should also update its last chunk crc. When a read request comes in, a block reader waits until its requested bytes are on disk and then return all the bytes up to the replica's disk length.

          Show
          Hairong Kuang added a comment - When I thought more about it, the replica length in my previous comment should be replica's disk length. So after a chunk is flushed to disk, a block receiver updates its disk length (as in trunk) and should also update its last chunk crc. When a read request comes in, a block reader waits until its requested bytes are on disk and then return all the bytes up to the replica's disk length.
          Hide
          sam rash added a comment -

          ported patch to trunk (hairong's idea of storing last checksum)

          Show
          sam rash added a comment - ported patch to trunk (hairong's idea of storing last checksum)
          Hide
          Nicolas Spiegelberg added a comment -

          This should be pulled into the branch-0.20-append branch.

          Show
          Nicolas Spiegelberg added a comment - This should be pulled into the branch-0.20-append branch.
          Hide
          Hairong Kuang added a comment -

          Thank Sam for working on the patch for trunk. Here are my comments:

          1. BlockSender.java:
            • the condition replica.getBytesOnDisk() < replicaVisibleLength should be gtBytesOnDisk() < startOffset + length. This guarantees that the bytes to be read have already flushed to the disk.
            • When the while loop exits and the bytes still have not flushed to disk yet, BlockSender should throw an IOException.
            • It seems to me that we should remove the use of replicaVisisbleLength from BlockSender;
            • the way to calculate endOffset should be
              if (startOffset + length falls into the same chunk where chunkChecksum.getDataLength() is located {
                  endOffset = chunkChecksum.getDataLength(); --- case 1
              } else {
                  endOffset = chunk boundary where (startOffset + length) is located
              }
              
            • In case 1, the last chunk's checksum does not need to be read from disk.
          2. ReplicaInPipeline, ReplicaPipeInterface, and ReplicaBeingWritten
            • I do not think we need to make any change to ReplicaInPipeline and ReplicaPipeInterface
            • We just need to adds the attribute lastChecksum and two synchronized method to ReplicaBeingWritten. Would it be more readable if we use the method names as getLastChecksumAndDataLen and setLastChecksumAndDataLen.
          Show
          Hairong Kuang added a comment - Thank Sam for working on the patch for trunk. Here are my comments: BlockSender.java: the condition replica.getBytesOnDisk() < replicaVisibleLength should be gtBytesOnDisk() < startOffset + length. This guarantees that the bytes to be read have already flushed to the disk. When the while loop exits and the bytes still have not flushed to disk yet, BlockSender should throw an IOException. It seems to me that we should remove the use of replicaVisisbleLength from BlockSender; the way to calculate endOffset should be if (startOffset + length falls into the same chunk where chunkChecksum.getDataLength() is located { endOffset = chunkChecksum.getDataLength(); --- case 1 } else { endOffset = chunk boundary where (startOffset + length) is located } In case 1, the last chunk's checksum does not need to be read from disk. ReplicaInPipeline, ReplicaPipeInterface, and ReplicaBeingWritten I do not think we need to make any change to ReplicaInPipeline and ReplicaPipeInterface We just need to adds the attribute lastChecksum and two synchronized method to ReplicaBeingWritten. Would it be more readable if we use the method names as getLastChecksumAndDataLen and setLastChecksumAndDataLen.
          Hide
          sam rash added a comment -

          Thanks for the quick review.
          I understand most of the comments, but have a couple of questions:

          1. replicaVisibleLength was here before I made any changes. Why is it not valid? I understood it to be an upper bound on the bytes that could be read from this block. Is it the case that start + length <= replicaVisibleLength and we want to optimize?

          (the for loop to wait for bytes on disk >= visible length was here before, I just moved it earlier in the constructor)

          2. not sure I understand endOffset. This was again a variable that already existed. What I thought you were getting at was the condition to decide if we should use the in-memory checksum or not (which is what you describe).

          3. If we don't put the sync set/get method in ReplicaInPipelineInterface, we will have to use an if/else construct on instanceof in BlockReceiver and call one or the other. I can see the argument for keeping the method out of the interface since it is RBW-specific, but on the other hand, it's effectively a no-op for other implementers of the interface and leads to cleaner code (better natural polymorphism then if-else constructs to force it).

          either way, just wanted to throw that out there as a question of style

          Show
          sam rash added a comment - Thanks for the quick review. I understand most of the comments, but have a couple of questions: 1. replicaVisibleLength was here before I made any changes. Why is it not valid? I understood it to be an upper bound on the bytes that could be read from this block. Is it the case that start + length <= replicaVisibleLength and we want to optimize? (the for loop to wait for bytes on disk >= visible length was here before, I just moved it earlier in the constructor) 2. not sure I understand endOffset. This was again a variable that already existed. What I thought you were getting at was the condition to decide if we should use the in-memory checksum or not (which is what you describe). 3. If we don't put the sync set/get method in ReplicaInPipelineInterface, we will have to use an if/else construct on instanceof in BlockReceiver and call one or the other. I can see the argument for keeping the method out of the interface since it is RBW-specific, but on the other hand, it's effectively a no-op for other implementers of the interface and leads to cleaner code (better natural polymorphism then if-else constructs to force it). either way, just wanted to throw that out there as a question of style
          Hide
          sam rash added a comment -

          1. new endOffset calc includes determining if in-memory checksum is needed
          2. added methods to RBW only to set/get last checksum and data length
          -track this dataLength separate as setBytesOnDisk may be called independently and make the length/byte[] not match (in theory bytes on disk could be set to more and we still want a checksum + the corresponding length kept)
          3. appropriate changes around waiting for start + length

          did not remove all replicaVisibleLength uses yet--want to clarify what to replace them with in pre-existing code.

          Show
          sam rash added a comment - 1. new endOffset calc includes determining if in-memory checksum is needed 2. added methods to RBW only to set/get last checksum and data length -track this dataLength separate as setBytesOnDisk may be called independently and make the length/byte[] not match (in theory bytes on disk could be set to more and we still want a checksum + the corresponding length kept) 3. appropriate changes around waiting for start + length did not remove all replicaVisibleLength uses yet--want to clarify what to replace them with in pre-existing code.
          Hide
          Hairong Kuang added a comment -

          The new patch is in a good shape! More comments list below:
          1. please remove all unnecessary indention change;
          2. BlockSender: endOffset = chunkChecksum.getDataLength();---this may cause NPE exception if replica to be read is not a RBW;
          3. BlockSender: If tmpLen >= endOffset, there might because the replica is a finalized replica and the reader tries to read the end of file and the end of file is not aligned at chunk boundary.

          Show
          Hairong Kuang added a comment - The new patch is in a good shape! More comments list below: 1. please remove all unnecessary indention change; 2. BlockSender: endOffset = chunkChecksum.getDataLength();---this may cause NPE exception if replica to be read is not a RBW; 3. BlockSender: If tmpLen >= endOffset, there might because the replica is a finalized replica and the reader tries to read the end of file and the end of file is not aligned at chunk boundary.
          Hide
          sam rash added a comment -

          I am addressing the last comments. I have one more question, though, as I have one test that it still fails and I want to see what you think the expected behavior should be:

          immediate read of a new file:
          1. writer creates a file and starts to write and hence blocks are assigned in the NN
          2. a reader gets these locations and contacts DN
          3. DN has not yet put the replica in the volumeMap and FSDataset.getVisibleLength() throws a MissingReplicaException

          In 0.20, I made it so that client just treats this as a 0-length file. What should the behavior in trunk be?

          Show
          sam rash added a comment - I am addressing the last comments. I have one more question, though, as I have one test that it still fails and I want to see what you think the expected behavior should be: immediate read of a new file: 1. writer creates a file and starts to write and hence blocks are assigned in the NN 2. a reader gets these locations and contacts DN 3. DN has not yet put the replica in the volumeMap and FSDataset.getVisibleLength() throws a MissingReplicaException In 0.20, I made it so that client just treats this as a 0-length file. What should the behavior in trunk be?
          Hide
          Hairong Kuang added a comment -

          The append design document discussed this scenario, in which it says it is OK to throw an exception because the chance of hitting this is low. But the semantics you made in 0.20 seems fine to me too.

          Show
          Hairong Kuang added a comment - The append design document discussed this scenario, in which it says it is OK to throw an exception because the chance of hitting this is low. But the semantics you made in 0.20 seems fine to me too.
          Hide
          sam rash added a comment -

          hmm, i can remove the test case. one of our internal tools saw this rather frequently in 0.20--maybe in trunk it's far less likely?

          Show
          sam rash added a comment - hmm, i can remove the test case. one of our internal tools saw this rather frequently in 0.20--maybe in trunk it's far less likely?
          Hide
          sam rash added a comment -

          1. endOffset is either bytesOnDisk or the chunkChecksum.getDataLength()
          2. if tmpLen >= endOffset && this is a write in progress, use the in--memory checksum (else this is a finalized block not ending on a chunk bondary)
          3. fixed up whitespace

          Show
          sam rash added a comment - 1. endOffset is either bytesOnDisk or the chunkChecksum.getDataLength() 2. if tmpLen >= endOffset && this is a write in progress, use the in--memory checksum (else this is a finalized block not ending on a chunk bondary) 3. fixed up whitespace
          Hide
          sam rash added a comment -

          also disabled out the test on read of immediate file for now. if we want to change how this is handled, I can enable it

          Show
          sam rash added a comment - also disabled out the test on read of immediate file for now. if we want to change how this is handled, I can enable it
          Hide
          Hairong Kuang added a comment -

          Sam, the patch is in good shape. Thanks for working on this. A few minor comments:
          1. ReplicaBeingWritten.java: dataLength and bytesOnDisk are the same, right? We do not need to introduce another field dataLength. I am also hesitate to delare datalength & lastchecksum as volatible. Accesses to them are already synchronized and the norm case is that writing without reading.
          2. We probably should remove setBytesOnDisk from ReplicaInPipelineInterface & ReplicaInPipeline.

          > In 0.20, I made it so that client just treats this as a 0-length file. one of our internal tools saw this rather frequently in 0.20.
          Good to know this. Then could you please handle this case in the trunk the same as well? Thanks again, Sam.

          Show
          Hairong Kuang added a comment - Sam, the patch is in good shape. Thanks for working on this. A few minor comments: 1. ReplicaBeingWritten.java: dataLength and bytesOnDisk are the same, right? We do not need to introduce another field dataLength. I am also hesitate to delare datalength & lastchecksum as volatible. Accesses to them are already synchronized and the norm case is that writing without reading. 2. We probably should remove setBytesOnDisk from ReplicaInPipelineInterface & ReplicaInPipeline. > In 0.20, I made it so that client just treats this as a 0-length file. one of our internal tools saw this rather frequently in 0.20. Good to know this. Then could you please handle this case in the trunk the same as well? Thanks again, Sam.
          Hide
          sam rash added a comment -

          1. they aren't guaranteed to be since there are methods to change the bytesOnDisk separate from the lastCheckSum bytes. It's entirely conceivable that something could update the bytes on disk w/o updating the lastChecksum with the current set of methods

          If we are ok with a loosely coupled guarantee, then we can use bytesOnDisk and be careful never to call setBytesOnDisk() for any RBW

          2. oh--your previous comments indicated we shouldn't change either ReplicaInPipelineInterface or ReplicaInPipeline. If that's not the case and we can do this, then my comment above doesn't hold. we use bytesOnDisk and guarantee it's in sync with the checksum in a single synchronized method (I like this)

          3. will make the update to treat missing last blocks as 0-length and re-instate the unit test.

          thanks for all the help on this

          Show
          sam rash added a comment - 1. they aren't guaranteed to be since there are methods to change the bytesOnDisk separate from the lastCheckSum bytes. It's entirely conceivable that something could update the bytes on disk w/o updating the lastChecksum with the current set of methods If we are ok with a loosely coupled guarantee, then we can use bytesOnDisk and be careful never to call setBytesOnDisk() for any RBW 2. oh--your previous comments indicated we shouldn't change either ReplicaInPipelineInterface or ReplicaInPipeline. If that's not the case and we can do this, then my comment above doesn't hold. we use bytesOnDisk and guarantee it's in sync with the checksum in a single synchronized method (I like this) 3. will make the update to treat missing last blocks as 0-length and re-instate the unit test. thanks for all the help on this
          Hide
          Hairong Kuang added a comment -

          > they aren't guaranteed to be since there are methods to change the bytesOnDisk separate from the lastCheckSum bytes.
          I do not see any place that updates bytesOnDisk except for BlockReceiver. That's why I suggested to remove setBytesOnDisk from ReplicaInPipeline etc.

          Show
          Hairong Kuang added a comment - > they aren't guaranteed to be since there are methods to change the bytesOnDisk separate from the lastCheckSum bytes. I do not see any place that updates bytesOnDisk except for BlockReceiver. That's why I suggested to remove setBytesOnDisk from ReplicaInPipeline etc.
          Hide
          Todd Lipcon added a comment -

          [for branch 0.20 append, +1 -- I've been running with this for 6 weeks, it works, and looks good!]

          Show
          Todd Lipcon added a comment - [for branch 0.20 append, +1 -- I've been running with this for 6 weeks, it works, and looks good!]
          Hide
          sam rash added a comment -

          So removing setBytesOnDisk() means:

                    if (replicaInfo instanceof ReplicaBeingWritten) {
                      ((ReplicaBeingWritten) replicaInfo)
                        .setLastChecksumAndDataLen(offsetInBlock, lastChunkChecksum);
                    }
                    
                    replicaInfo.setBytesOnDisk(offsetInBlock);
          

          will not have the latter. So all other implementations of Replica will have a valid value for getByteOnDIsk()?
          Does this also mean that the impl of getBytesOnDisk for ReplicaInPipeline will move to ReplicaBeingWritten ?

          Show
          sam rash added a comment - So removing setBytesOnDisk() means: if (replicaInfo instanceof ReplicaBeingWritten) { ((ReplicaBeingWritten) replicaInfo) .setLastChecksumAndDataLen(offsetInBlock, lastChunkChecksum); } replicaInfo.setBytesOnDisk(offsetInBlock); will not have the latter. So all other implementations of Replica will have a valid value for getByteOnDIsk()? Does this also mean that the impl of getBytesOnDisk for ReplicaInPipeline will move to ReplicaBeingWritten ?
          Hide
          sam rash added a comment -

          another way to ask this: only ReplicaBeingWritten needs to have the bytes on disk set in BlockRecevier?

          Show
          sam rash added a comment - another way to ask this: only ReplicaBeingWritten needs to have the bytes on disk set in BlockRecevier?
          Hide
          Hairong Kuang added a comment -

          I meant removing setBytesOnDisk and moving setLastChecksumAndDataLen to ReplicaPipelineInterface.

          Show
          Hairong Kuang added a comment - I meant removing setBytesOnDisk and moving setLastChecksumAndDataLen to ReplicaPipelineInterface.
          Hide
          sam rash added a comment -

          got it. i will make the changes and get a patch soon

          Show
          sam rash added a comment - got it. i will make the changes and get a patch soon
          Hide
          dhruba borthakur added a comment -

          we need this for the 0.20-append branch too.

          Show
          dhruba borthakur added a comment - we need this for the 0.20-append branch too.
          Hide
          dhruba borthakur added a comment -

          hi sam, will it be possible for you to address hairong's feedback and provide a new patch? Thanks.

          Show
          dhruba borthakur added a comment - hi sam, will it be possible for you to address hairong's feedback and provide a new patch? Thanks.
          Hide
          sam rash added a comment -

          i have an updated patch, but it does not yet handle the missing replicas as 0 sized for under construction. there may be other 20 patches to port to make this happen.

          Show
          sam rash added a comment - i have an updated patch, but it does not yet handle the missing replicas as 0 sized for under construction. there may be other 20 patches to port to make this happen.
          Hide
          sam rash added a comment -

          includes requested changes by hairong. also handles immediate reading of new files by translating a ReplicaNotFoundException into a 0-length block within DFSInputStream for under construction files

          Show
          sam rash added a comment - includes requested changes by hairong. also handles immediate reading of new files by translating a ReplicaNotFoundException into a 0-length block within DFSInputStream for under construction files
          Hide
          Nicolas Spiegelberg added a comment -

          fix of HDFS-1057 for 0.20-append branch (courtesy of Todd)

          Show
          Nicolas Spiegelberg added a comment - fix of HDFS-1057 for 0.20-append branch (courtesy of Todd)
          Hide
          sam rash added a comment -

          patch should use --no-prefix to get rid of 'a' and 'b' in paths, fyi

          Show
          sam rash added a comment - patch should use --no-prefix to get rid of 'a' and 'b' in paths, fyi
          Hide
          Hairong Kuang added a comment -

          > translating a ReplicaNotFoundException into a 0-length block within DFSInputStream for under construction files
          Should the code check all replicas before returning 0?

          Ideally NN should not return block locations to readers if the block construction is still in the pipeline construction stage.

          Show
          Hairong Kuang added a comment - > translating a ReplicaNotFoundException into a 0-length block within DFSInputStream for under construction files Should the code check all replicas before returning 0? Ideally NN should not return block locations to readers if the block construction is still in the pipeline construction stage.
          Hide
          sam rash added a comment -

          sorry, i don't understand. this is a race condition where the namenode has assigned locations to the block, but the client hasn't sent data yet. the NN cannot know that the DNs don't have data on disk yet unless we add additional NN coordination? our choice in this condition is to return 0 or let the exception be. I had done the latter, but you asked for the former unless I misunderstood.

          can you clarify what you want?

          Show
          sam rash added a comment - sorry, i don't understand. this is a race condition where the namenode has assigned locations to the block, but the client hasn't sent data yet. the NN cannot know that the DNs don't have data on disk yet unless we add additional NN coordination? our choice in this condition is to return 0 or let the exception be. I had done the latter, but you asked for the former unless I misunderstood. can you clarify what you want?
          Hide
          Hairong Kuang added a comment -

          Yeah,NN does not know the block construction stages.
          > can you clarify what you want?
          The patch returns 0 if one datanode throws ReplicaNotFoundException. Should it check all replicas before returning 0?

          Show
          Hairong Kuang added a comment - Yeah,NN does not know the block construction stages. > can you clarify what you want? The patch returns 0 if one datanode throws ReplicaNotFoundException. Should it check all replicas before returning 0?
          Hide
          sam rash added a comment -

          per out offline discussion, it seems the NN doesn't know when the pipeline is created, but the writer does, so the NN has to return the replicas for the current block in this case.

          I will change it so we check all DNs for a replica before using the default of 0. I need to think about if we require all DNs to have ReplicaNotFound or all have that (versus some other exception).

          Show
          sam rash added a comment - per out offline discussion, it seems the NN doesn't know when the pipeline is created, but the writer does, so the NN has to return the replicas for the current block in this case. I will change it so we check all DNs for a replica before using the default of 0. I need to think about if we require all DNs to have ReplicaNotFound or all have that (versus some other exception).
          Hide
          sam rash added a comment -

          -returns 0 length only if all DNs are missing the replica (any other io exception will cause client to get exception and it can retry)
          -my diff viewer does not show any whitespace or indentation changes, but please advise if you see any

          Show
          sam rash added a comment - -returns 0 length only if all DNs are missing the replica (any other io exception will cause client to get exception and it can retry) -my diff viewer does not show any whitespace or indentation changes, but please advise if you see any
          Hide
          Hairong Kuang added a comment -

          +1. The trunk patch looks good. Great work, Sam!

          Show
          Hairong Kuang added a comment - +1. The trunk patch looks good. Great work, Sam!
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12448081/hdfs-1057-trunk-5.txt
          against trunk revision 957669.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 8 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          -1 javac. The applied patch generated 30 javac compiler warnings (more than the trunk's current 23 warnings).

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/207/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/207/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/207/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/207/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12448081/hdfs-1057-trunk-5.txt against trunk revision 957669. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 8 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 30 javac compiler warnings (more than the trunk's current 23 warnings). +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/207/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/207/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/207/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/207/console This message is automatically generated.
          Hide
          sam rash added a comment -

          the one test that failed from my new tests had an fd leak. i've corrected that. the other failed tests I cannot reproduce:

          1. org.apache.hadoop.hdfs.TestFileConcurrentReader.testUnfinishedBlockCRCErrorNormalTransferVerySmallWrite
          -had fd leak, fixed

          2. org.apache.hadoop.hdfs.security.token.block.TestBlockToken.testBlockTokenRpc

          [junit] Running org.apache.hadoop.hdfs.security.token.block.TestBlockToken
          [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 1.305 sec

          3. org.apache.hadoop.hdfs.server.common.TestJspHelper.testGetUgi

          [junit] Running org.apache.hadoop.hdfs.server.common.TestJspHelper
          [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.309 sec

          I can submit the patch with the fix for #1 plus warning fixes

          Show
          sam rash added a comment - the one test that failed from my new tests had an fd leak. i've corrected that. the other failed tests I cannot reproduce: 1. org.apache.hadoop.hdfs.TestFileConcurrentReader.testUnfinishedBlockCRCErrorNormalTransferVerySmallWrite -had fd leak, fixed 2. org.apache.hadoop.hdfs.security.token.block.TestBlockToken.testBlockTokenRpc [junit] Running org.apache.hadoop.hdfs.security.token.block.TestBlockToken [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 1.305 sec 3. org.apache.hadoop.hdfs.server.common.TestJspHelper.testGetUgi [junit] Running org.apache.hadoop.hdfs.server.common.TestJspHelper [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.309 sec I can submit the patch with the fix for #1 plus warning fixes
          Hide
          sam rash added a comment -

          -fixed warnings
          -fixed fd leak in some of the added tests

          Show
          sam rash added a comment - -fixed warnings -fixed fd leak in some of the added tests
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12448323/hdfs-1057-trunk-6.txt
          against trunk revision 957669.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 8 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/416/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/416/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/416/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/416/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12448323/hdfs-1057-trunk-6.txt against trunk revision 957669. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 8 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/416/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/416/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/416/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/416/console This message is automatically generated.
          Hide
          sam rash added a comment -

          from the raw console output of hudson:

          [exec] [junit] Tests run: 3, Failures: 0, Errors: 1, Time elapsed: 0.624 sec
          [exec] [junit] Test org.apache.hadoop.hdfs.security.token.block.TestBlockToken FAILED

          [exec] [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0.706 sec
          [exec] [junit] Test org.apache.hadoop.hdfs.server.common.TestJspHelper FAILED

          [exec] [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 28.477 sec
          [exec] [junit] Test org.apache.hadoop.hdfsproxy.TestHdfsProxy FAILED

          I ran the tests locally and the first 2 succeed. The third fails on the latest trunk without hdfs-1057. I think from the test perspective, this is safe to commit.

          1. TestBlockToken

          run-test-hdfs:
          [delete] Deleting directory /data/users/srash/apache/hadoop-hdfs/build/test/data
          [mkdir] Created dir: /data/users/srash/apache/hadoop-hdfs/build/test/data
          [delete] Deleting directory /data/users/srash/apache/hadoop-hdfs/build/test/logs
          [mkdir] Created dir: /data/users/srash/apache/hadoop-hdfs/build/test/logs
          [junit] WARNING: multiple versions of ant detected in path for junit
          [junit] jar:file:/usr/local/ant/lib/ant.jar!/org/apache/tools/ant/Project.class
          [junit] and jar:file:/home/srash/.ivy2/cache/ant/ant/jars/ant-1.6.5.jar!/org/apache/tools/ant/Project.class
          [junit] Running org.apache.hadoop.hdfs.security.token.block.TestBlockToken
          [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 1.248 sec

          2. TestJspHelper
          run-test-hdfs:
          [delete] Deleting directory /data/users/srash/apache/hadoop-hdfs/build/test/data
          [mkdir] Created dir: /data/users/srash/apache/hadoop-hdfs/build/test/data
          [delete] Deleting directory /data/users/srash/apache/hadoop-hdfs/build/test/logs
          [mkdir] Created dir: /data/users/srash/apache/hadoop-hdfs/build/test/logs
          [junit] WARNING: multiple versions of ant detected in path for junit
          [junit] jar:file:/usr/local/ant/lib/ant.jar!/org/apache/tools/ant/Project.class
          [junit] and jar:file:/home/srash/.ivy2/cache/ant/ant/jars/ant-1.6.5.jar!/org/apache/tools/ant/Project.class
          [junit] Running org.apache.hadoop.hdfs.server.common.TestJspHelper
          [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.275 sec

          Show
          sam rash added a comment - from the raw console output of hudson: [exec] [junit] Tests run: 3, Failures: 0, Errors: 1, Time elapsed: 0.624 sec [exec] [junit] Test org.apache.hadoop.hdfs.security.token.block.TestBlockToken FAILED – [exec] [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0.706 sec [exec] [junit] Test org.apache.hadoop.hdfs.server.common.TestJspHelper FAILED – [exec] [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 28.477 sec [exec] [junit] Test org.apache.hadoop.hdfsproxy.TestHdfsProxy FAILED I ran the tests locally and the first 2 succeed. The third fails on the latest trunk without hdfs-1057. I think from the test perspective, this is safe to commit. 1. TestBlockToken run-test-hdfs: [delete] Deleting directory /data/users/srash/apache/hadoop-hdfs/build/test/data [mkdir] Created dir: /data/users/srash/apache/hadoop-hdfs/build/test/data [delete] Deleting directory /data/users/srash/apache/hadoop-hdfs/build/test/logs [mkdir] Created dir: /data/users/srash/apache/hadoop-hdfs/build/test/logs [junit] WARNING: multiple versions of ant detected in path for junit [junit] jar: file:/usr/local/ant/lib/ant.jar!/org/apache/tools/ant/Project.class [junit] and jar: file:/home/srash/.ivy2/cache/ant/ant/jars/ant-1.6.5.jar!/org/apache/tools/ant/Project.class [junit] Running org.apache.hadoop.hdfs.security.token.block.TestBlockToken [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 1.248 sec 2. TestJspHelper run-test-hdfs: [delete] Deleting directory /data/users/srash/apache/hadoop-hdfs/build/test/data [mkdir] Created dir: /data/users/srash/apache/hadoop-hdfs/build/test/data [delete] Deleting directory /data/users/srash/apache/hadoop-hdfs/build/test/logs [mkdir] Created dir: /data/users/srash/apache/hadoop-hdfs/build/test/logs [junit] WARNING: multiple versions of ant detected in path for junit [junit] jar: file:/usr/local/ant/lib/ant.jar!/org/apache/tools/ant/Project.class [junit] and jar: file:/home/srash/.ivy2/cache/ant/ant/jars/ant-1.6.5.jar!/org/apache/tools/ant/Project.class [junit] Running org.apache.hadoop.hdfs.server.common.TestJspHelper [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.275 sec
          Hide
          Hairong Kuang added a comment -

          I've committed the trunk change to 0.21 and trunk. Thanks Sam!

          Show
          Hairong Kuang added a comment - I've committed the trunk change to 0.21 and trunk. Thanks Sam!
          Hide
          dhruba borthakur added a comment -

          I have committed it to 0.20-append branch as well. Thanks Sam!

          Show
          dhruba borthakur added a comment - I have committed it to 0.20-append branch as well. Thanks Sam!
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Sam/Hairong/Dhruba,TestFileConcurrentReader keeps failing. Could you take a look? Thank you in advance. See HDFS-1310.

          Show
          Tsz Wo Nicholas Sze added a comment - Sam/Hairong/Dhruba,TestFileConcurrentReader keeps failing. Could you take a look? Thank you in advance. See HDFS-1310 .
          Hide
          Tsz Wo Nicholas Sze added a comment -

          If there is no follow-up on the test failure, I suggest that we should first revert the committed patch and then fix the problem here.

          Show
          Tsz Wo Nicholas Sze added a comment - If there is no follow-up on the test failure, I suggest that we should first revert the committed patch and then fix the problem here.
          Hide
          dhruba borthakur added a comment -

          I agree with Nicholas, in the worst case shall we get rid of TestFileConcurrentReader itself?

          Show
          dhruba borthakur added a comment - I agree with Nicholas, in the worst case shall we get rid of TestFileConcurrentReader itself?
          Hide
          sam rash added a comment -

          I'm confused--the jira for the test result indicated you had solved the problem. Can you let me know what you need me to do?

          Show
          sam rash added a comment - I'm confused--the jira for the test result indicated you had solved the problem. Can you let me know what you need me to do?
          Show
          Tanping Wang added a comment - Hi, Sam, I have just commented on HDFS-1310 . Would you please look at Hudson result for reference, https://hudson.apache.org/hudson/job/Hadoop-Hdfs-trunk/413/testReport/org.apache.hadoop.hdfs/TestFileConcurrentReader/testUnfinishedBlockCRCErrorTransferToVerySmallWrite/ https://hudson.apache.org/hudson/job/Hadoop-Hdfs-trunk/413/testReport/org.apache.hadoop.hdfs/TestFileConcurrentReader/testUnfinishedBlockCRCErrorNormalTransfer/ https://hudson.apache.org/hudson/job/Hadoop-Hdfs-trunk/413/testReport/org.apache.hadoop.hdfs/TestFileConcurrentReader/testUnfinishedBlockCRCErrorNormalTransferVerySmallWrite/ Thanks!
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Hi Sam, TestFileConcurrentReader fails again, created HDFS-1679. It would be great if you could take a look.

          Show
          Tsz Wo Nicholas Sze added a comment - Hi Sam, TestFileConcurrentReader fails again, created HDFS-1679 . It would be great if you could take a look.
          Hide
          sam rash added a comment -

          does this test include the patch:

          https://issues.apache.org/jira/browse/HADOOP-6907

          ?

          Show
          sam rash added a comment - does this test include the patch: https://issues.apache.org/jira/browse/HADOOP-6907 ?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Build #221 should include HADOOP-6907, which has been committed for a few months.

          Show
          Tsz Wo Nicholas Sze added a comment - Build #221 should include HADOOP-6907 , which has been committed for a few months.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          TestFileConcurrentReader has a long history for failing. There are multiple JIRAs: HDFS-1401, HDFS-1310, HDFS-1679, HDFS-1885 but the problem is still not fixed. I suggest simply remove it since no one is maintaining it.

          Show
          Tsz Wo Nicholas Sze added a comment - TestFileConcurrentReader has a long history for failing. There are multiple JIRAs: HDFS-1401 , HDFS-1310 , HDFS-1679 , HDFS-1885 but the problem is still not fixed. I suggest simply remove it since no one is maintaining it.
          Hide
          sam rash added a comment -

          the last time I debugged the test failure, it exposed a socket/fd leak in a completely unrelated part of the code. The test failing here also has 0 to do with the added feature--because it closes/opens files in rapid succession, it is prone expose resource leaks.

          Removing this test (or feature) won't take away the underlying problem that should be looked at.

          Show
          sam rash added a comment - the last time I debugged the test failure, it exposed a socket/fd leak in a completely unrelated part of the code. The test failing here also has 0 to do with the added feature--because it closes/opens files in rapid succession, it is prone expose resource leaks. Removing this test (or feature) won't take away the underlying problem that should be looked at.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Hi Sam, I agree that there are bugs in Hadoop or any other software. However, I don't see any value of putting a test to fail every build but not fixing the so called "underlying problem".

          BTW, it is questionable whether the test is a valid. Are we supporting that many concurrent readers? How was the number chosen? Machine has finite resource. It is definitely easy to create extreme tests and make them fail.

          Show
          Tsz Wo Nicholas Sze added a comment - Hi Sam, I agree that there are bugs in Hadoop or any other software. However, I don't see any value of putting a test to fail every build but not fixing the so called "underlying problem". BTW, it is questionable whether the test is a valid. Are we supporting that many concurrent readers? How was the number chosen? Machine has finite resource. It is definitely easy to create extreme tests and make them fail.
          Hide
          sam rash added a comment -

          the test opens/closes files for read/write. that exposed a slow leak last time.

          I suggest anyone concerned with resources leaks in hadoop investigate. we don't see the test failure in our open-sourced 0.20 fork

          removing the test is an option; or coming up with a better one (this was my first hdfs feature + test).

          Show
          sam rash added a comment - the test opens/closes files for read/write. that exposed a slow leak last time. I suggest anyone concerned with resources leaks in hadoop investigate. we don't see the test failure in our open-sourced 0.20 fork removing the test is an option; or coming up with a better one (this was my first hdfs feature + test).
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Hey Sam, do you exactly know the "underlying problem" or it is just a hypothesis?

          Could you work on a better one? It seems that no one is working it and the test was failing for almost a year. So I suggest removing it. It would be great if you can improve it.

          Show
          Tsz Wo Nicholas Sze added a comment - Hey Sam, do you exactly know the "underlying problem" or it is just a hypothesis? Could you work on a better one? It seems that no one is working it and the test was failing for almost a year. So I suggest removing it. It would be great if you can improve it.
          Hide
          sam rash added a comment -

          i assume a similar problem as before. The problem was that code that opened RPC proxies to DNs did not get closed in a finally block. The test failure output indicates a socket/fd leak ("Too many open files").

          https://issues.apache.org/jira/browse/HDFS-1310

          the test was succeeding 8 months ago, 2010-09-10, so I'd look at commits that came after that.

          Show
          sam rash added a comment - i assume a similar problem as before. The problem was that code that opened RPC proxies to DNs did not get closed in a finally block. The test failure output indicates a socket/fd leak ("Too many open files"). https://issues.apache.org/jira/browse/HDFS-1310 the test was succeeding 8 months ago, 2010-09-10, so I'd look at commits that came after that.
          Hide
          Todd Lipcon added a comment -

          I believe this test mostly fails on the build infrastructure and not on our local boxes. So that limits the number of people who can investigate. I will try to take a look by running tests there.

          Show
          Todd Lipcon added a comment - I believe this test mostly fails on the build infrastructure and not on our local boxes. So that limits the number of people who can investigate. I will try to take a look by running tests there.
          Hide
          sam rash added a comment -

          Todd: on a different issue, one test in here that looks suspicious is the testImmediateReadOfNewFile.

          It repeatedly opens and closes a file right away, requiring 1k successful opens (at least in our copy)

          Show
          sam rash added a comment - Todd: on a different issue, one test in here that looks suspicious is the testImmediateReadOfNewFile. It repeatedly opens and closes a file right away, requiring 1k successful opens (at least in our copy)
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > I believe this test mostly fails on the build infrastructure ...

          It seems that the machines on the build infrastructure are slow/old/heavily loaded. Tests may be easier to fail there but not locally. So that choosing the test parameters, e.g. how many concurrent writer, becomes non-trivial.

          Show
          Tsz Wo Nicholas Sze added a comment - > I believe this test mostly fails on the build infrastructure ... It seems that the machines on the build infrastructure are slow/old/heavily loaded. Tests may be easier to fail there but not locally. So that choosing the test parameters, e.g. how many concurrent writer, becomes non-trivial.
          Hide
          sam rash added a comment -

          if it helps, there is only ever 1 writer + 1 reader in the test. 1 reader 'tails' by opening and closing the file repeatedly, up to 1000 times (hence exposing socket leaks in the past)

          Show
          sam rash added a comment - if it helps, there is only ever 1 writer + 1 reader in the test. 1 reader 'tails' by opening and closing the file repeatedly, up to 1000 times (hence exposing socket leaks in the past)
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Sam, could you either investigate the underlying problem or improve the test so that it won't fail on hudson?

          Show
          Tsz Wo Nicholas Sze added a comment - Sam, could you either investigate the underlying problem or improve the test so that it won't fail on hudson?
          Hide
          Todd Lipcon added a comment -

          Sam seems to be correct that there's some kind of leak going on. lsof on the java process shows several hundred unix sockets open.

          Show
          Todd Lipcon added a comment - Sam seems to be correct that there's some kind of leak going on. lsof on the java process shows several hundred unix sockets open.
          Hide
          Todd Lipcon added a comment -

          I patched HADOOP-7146 into a local build and it fixed the issue on the "h2" build box. Will elevate priority of that issue.

          Show
          Todd Lipcon added a comment - I patched HADOOP-7146 into a local build and it fixed the issue on the "h2" build box. Will elevate priority of that issue.
          Hide
          sam rash added a comment -

          todd: thanks for digging into this

          Show
          sam rash added a comment - todd: thanks for digging into this
          Hide
          Todd Lipcon added a comment -

          Actually, it looks like the leak in 7146 "pushed this over the edge". But, even with that patch, if I "lsof" the java process as it runs, I see it hit 800 or so localhost TCP connections in ESTABLISHED state while running this test case. So, needs more investigation yet.

          Show
          Todd Lipcon added a comment - Actually, it looks like the leak in 7146 "pushed this over the edge". But, even with that patch, if I "lsof" the java process as it runs, I see it hit 800 or so localhost TCP connections in ESTABLISHED state while running this test case. So, needs more investigation yet.
          Hide
          Todd Lipcon added a comment -

          aha! I think I understand what's going on here!

          The test has a thread which continually re-opens the file which is being written to. Since the file's in the middle of being written, it makes an RPC to the DataNode in order to determine the visible length of the file. This RPC is authenticated using the block token which came back in the LocatedBlocks object as the security ticket.

          When this RPC hits the IPC layer, it looks at its existing connections and sees none that can be re-used, since the block token differs between the two requesters. Hence, it reconnects, and we end up with hundreds or thousands of IPC connections to the datanode.

          This also explains why Sam doesn't see it on his 0.20 append branch – there are no block tokens there, so the RPC connection is getting reused properly.

          I'll file another JIRA about this issue.

          Show
          Todd Lipcon added a comment - aha! I think I understand what's going on here! The test has a thread which continually re-opens the file which is being written to. Since the file's in the middle of being written, it makes an RPC to the DataNode in order to determine the visible length of the file. This RPC is authenticated using the block token which came back in the LocatedBlocks object as the security ticket. When this RPC hits the IPC layer, it looks at its existing connections and sees none that can be re-used, since the block token differs between the two requesters. Hence, it reconnects, and we end up with hundreds or thousands of IPC connections to the datanode. This also explains why Sam doesn't see it on his 0.20 append branch – there are no block tokens there, so the RPC connection is getting reused properly. I'll file another JIRA about this issue.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Todd, well done! Thanks for investigating it.

          Show
          Tsz Wo Nicholas Sze added a comment - Todd, well done! Thanks for investigating it.
          Hide
          Jitendra Nath Pandey added a comment -

          Patch for 20-security branch uploaded.

          Show
          Jitendra Nath Pandey added a comment - Patch for 20-security branch uploaded.
          Hide
          Suresh Srinivas added a comment -

          +1 for the patch. I committed it to 0.20-security.

          Show
          Suresh Srinivas added a comment - +1 for the patch. I committed it to 0.20-security.

            People

            • Assignee:
              sam rash
              Reporter:
              Todd Lipcon
            • Votes:
              1 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development