Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-941

Datanode xceiver protocol should allow reuse of a connection

    Details

    • Hadoop Flags:
      Reviewed

      Description

      Right now each connection into the datanode xceiver only processes one operation.

      In the case that an operation leaves the stream in a well-defined state (eg a client reads to the end of a block successfully) the same connection could be reused for a second operation. This should improve random read performance significantly.

      1. HDFS-941-1.patch
        42 kB
        bc Wong
      2. HDFS-941-2.patch
        62 kB
        bc Wong
      3. HDFS-941-3.patch
        63 kB
        bc Wong
      4. HDFS-941-3.patch
        63 kB
        bc Wong
      5. HDFS-941-4.patch
        64 kB
        bc Wong
      6. HDFS-941-5.patch
        82 kB
        bc Wong
      7. hdfs941-1.png
        7 kB
        Kihwal Lee
      8. HDFS-941-6.patch
        55 kB
        bc Wong
      9. HDFS-941-6.patch
        55 kB
        bc Wong
      10. HDFS-941-6.patch
        55 kB
        bc Wong
      11. hdfs-941.txt
        56 kB
        Todd Lipcon
      12. hdfs-941.txt
        58 kB
        Todd Lipcon
      13. HDFS-941-6.22.patch
        55 kB
        stack
      14. hdfs-941.txt
        60 kB
        Todd Lipcon
      15. fix-close-delta.txt
        4 kB
        Todd Lipcon
      16. hdfs-941.txt
        60 kB
        Todd Lipcon
      17. 941.22.txt
        63 kB
        stack
      18. 941.22.txt
        63 kB
        stack
      19. 941.22.v2.txt
        63 kB
        stack
      20. 941.22.v3.txt
        61 kB
        stack

        Issue Links

          Activity

          Hide
          Brian Bockelman added a comment -

          From a site which uses mostly random reads, I vote a +10 to this idea.

          Show
          Brian Bockelman added a comment - From a site which uses mostly random reads, I vote a +10 to this idea.
          Hide
          stack added a comment -

          I see Brians' +10 and up it +1.

          Show
          stack added a comment - I see Brians' +10 and up it +1.
          Hide
          bc Wong added a comment -

          Attaching a patch, which:

          • In DFSInputStream, stores the last socket and the owner (BlockReader) of that socket.
          • If the owner did read till EOS, then it'll consider reusing that socket.
          • It's a cache of size 1, but easy to extend to n. However, size of 1 makes sense for now because:
          • The DN (DataXceiver) no longer closes the client connection after every op. It optimistically expects the next op, with a very low timeout value. It closes the connection only on EOS or timeout.

          Note that the HDFS-941-1.patch depends on the HDFS-1001 patch.

          Show
          bc Wong added a comment - Attaching a patch, which: In DFSInputStream , stores the last socket and the owner ( BlockReader ) of that socket. If the owner did read till EOS, then it'll consider reusing that socket. It's a cache of size 1, but easy to extend to n . However, size of 1 makes sense for now because: The DN ( DataXceiver ) no longer closes the client connection after every op. It optimistically expects the next op, with a very low timeout value. It closes the connection only on EOS or timeout. Note that the HDFS-941 -1.patch depends on the HDFS-1001 patch.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12438934/HDFS-941-1.patch
          against trunk revision 923467.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/130/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12438934/HDFS-941-1.patch against trunk revision 923467. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/130/console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          Style notes:

          • in BlockReader:
            +      LOG.warn("Could not write to datanode " + sock.getInetAddress() +
            +               ": " + e.getMessage());
            

            should be more specific - like "Could not write read result status code" and also indicate in the warning somehow that this is not a critical problem. Perhaps "info" level is better? (in my experience if people see WARN they think something is seriously wrong)

          • please move the inner SocketCacheEntry class down lower in DFSInputStream
          • in SocketCacheEntry.setOwner, can you use IOUtils.closeStream to close reader? Similarly in SocketCacheEntry.close
          • We expect the following may happen reasonably often, right?
            +        // Our socket is no good.
            +        DFSClient.LOG.warn("Error making BlockReader. Closing stale " + entry.sock.toString());
            

            I think this should probably be debug level.

          • The edits to the docs in DataNode.java are good - if possible they should probably move into HDFS-1001 though, no?
          • the do { ... }

            while () loop is a bit hard to follow in DataXceiver. Would it be possible to rearrange the code a bit to be more linear? (eg setting DN_KEEPALIVE_TIMEOUT right before the read at the beginning of the loop if workDone > 0 would be easier to follow in my opinion)

          • In DataXceiver:
            +      } catch (IOException ioe) {
            +        LOG.error("Error reading client status response. Will close connection. Err: " + ioe);
            

            Doesn't this yield error messages on every incomplete client read? Since the response is optional, this seems more like a DEBUG.

          Bigger stuff:

          • I think there is a concurrency issue here. Namely, the positional read API calls through into fetchBlockByteRange, which will use the existing cached socket, regardless of other concurrent operations. So we may end up with multiple block readers on the same socket and everything will fall apart.

          Can you add a test case which tests concurrent use of a DFSInputStream? Maybe a few threads doing random positional reads while another thread does seeks and sequential reads?

          • Regarding the cache size of one - I don't think this is quite true. For a use case like HBase, the region server is continually slamming the local datanode with random read requests from several client threads. Is the idea that such an application should be using multiple DFSInputStreams to read the same file and handle the multithreading itself?
          • In DataXceiver, SocketException is caught and ignored while sending a block. ("// Its ok for remote side to close the connection anytime." I think there are other SocketException types (eg timeout) that could throw here aside from a connection close, so in that case we need to IOUtils.closeStream(out) I believe. A test case for this could be to open a BlockReader, read some bytes, then stop reading so that the other side's BlockSender generates a timeout.
          • Not sure about this removal in the finally clause of opWriteBlock:
            -      IOUtils.closeStream(replyOut);
            

            (a) We still need to close in the case of an downstream-generated exception. Otherwise we'll read the next data bytes from the writer as an operation and have undefined results.
            (b) To keep this patch less dangerous, maybe we should not add the reuse feature for operations other than read? Read's the only operation where we expect a lot of very short requests coming in - not much benefit for writes, etc, plus they're more complicated.

          Show
          Todd Lipcon added a comment - Style notes: in BlockReader: + LOG.warn( "Could not write to datanode " + sock.getInetAddress() + + ": " + e.getMessage()); should be more specific - like "Could not write read result status code" and also indicate in the warning somehow that this is not a critical problem. Perhaps "info" level is better? (in my experience if people see WARN they think something is seriously wrong) please move the inner SocketCacheEntry class down lower in DFSInputStream in SocketCacheEntry.setOwner, can you use IOUtils.closeStream to close reader? Similarly in SocketCacheEntry.close We expect the following may happen reasonably often, right? + // Our socket is no good. + DFSClient.LOG.warn( "Error making BlockReader. Closing stale " + entry.sock.toString()); I think this should probably be debug level. The edits to the docs in DataNode.java are good - if possible they should probably move into HDFS-1001 though, no? the do { ... } while () loop is a bit hard to follow in DataXceiver. Would it be possible to rearrange the code a bit to be more linear? (eg setting DN_KEEPALIVE_TIMEOUT right before the read at the beginning of the loop if workDone > 0 would be easier to follow in my opinion) In DataXceiver: + } catch (IOException ioe) { + LOG.error( "Error reading client status response. Will close connection. Err: " + ioe); Doesn't this yield error messages on every incomplete client read? Since the response is optional, this seems more like a DEBUG. Bigger stuff: I think there is a concurrency issue here. Namely, the positional read API calls through into fetchBlockByteRange, which will use the existing cached socket, regardless of other concurrent operations. So we may end up with multiple block readers on the same socket and everything will fall apart. Can you add a test case which tests concurrent use of a DFSInputStream? Maybe a few threads doing random positional reads while another thread does seeks and sequential reads? Regarding the cache size of one - I don't think this is quite true. For a use case like HBase, the region server is continually slamming the local datanode with random read requests from several client threads. Is the idea that such an application should be using multiple DFSInputStreams to read the same file and handle the multithreading itself? In DataXceiver, SocketException is caught and ignored while sending a block. ("// Its ok for remote side to close the connection anytime." I think there are other SocketException types (eg timeout) that could throw here aside from a connection close, so in that case we need to IOUtils.closeStream(out) I believe. A test case for this could be to open a BlockReader, read some bytes, then stop reading so that the other side's BlockSender generates a timeout. Not sure about this removal in the finally clause of opWriteBlock: - IOUtils.closeStream(replyOut); (a) We still need to close in the case of an downstream-generated exception. Otherwise we'll read the next data bytes from the writer as an operation and have undefined results. (b) To keep this patch less dangerous, maybe we should not add the reuse feature for operations other than read? Read's the only operation where we expect a lot of very short requests coming in - not much benefit for writes, etc, plus they're more complicated.
          Hide
          bc Wong added a comment -

          Thanks for the review, Todd. I'll add more tests, and look into making a cache of size > 1.

          I think there is a concurrency issue here. Namely, the positional read API calls through into fetchBlockByteRange, which will use the existing cached socket, regardless of other concurrent operations. So we may end up with multiple block readers on the same socket and everything will fall apart.

          That should be fine. Each SocketCacheEntry has a unique Socket, owned by its BlockReader. One of the reuse condition is that the BlockReader has finished reading on that Socket (hasConsumedAll()). Note that we do not reuse BlockReader. So at this point, it should be safe to take the Socket away from its previous owner and give it to a new BlockReader.

          I'll add tests for this though.

          Show
          bc Wong added a comment - Thanks for the review, Todd. I'll add more tests, and look into making a cache of size > 1. I think there is a concurrency issue here. Namely, the positional read API calls through into fetchBlockByteRange, which will use the existing cached socket, regardless of other concurrent operations. So we may end up with multiple block readers on the same socket and everything will fall apart. That should be fine. Each SocketCacheEntry has a unique Socket , owned by its BlockReader . One of the reuse condition is that the BlockReader has finished reading on that Socket ( hasConsumedAll() ). Note that we do not reuse BlockReader . So at this point, it should be safe to take the Socket away from its previous owner and give it to a new BlockReader . I'll add tests for this though.
          Hide
          Todd Lipcon added a comment -

          So at this point, it should be safe to take the Socket away from its previous owner and give it to a new BlockReader.

          I don't think this is true. Two issues:

          • At DFSInputStream.java:699, we check the cached socket for usability. If we have two threads concurrently at this line, they can both detect that it's usable, and then both move on to construct block readers on the same socket. One of them will win the entry.setOwner race at line 723, but they'll both still be operating on the same socket and generate bad commands.
          • If a block reader exists and is not complete, usable() returns false so the socket won't be reused. But on line 701, it calls entry.close() which calls through to close the underlying socket. This will destroy the read currently going on.
          Show
          Todd Lipcon added a comment - So at this point, it should be safe to take the Socket away from its previous owner and give it to a new BlockReader. I don't think this is true. Two issues: At DFSInputStream.java:699, we check the cached socket for usability. If we have two threads concurrently at this line, they can both detect that it's usable, and then both move on to construct block readers on the same socket. One of them will win the entry.setOwner race at line 723, but they'll both still be operating on the same socket and generate bad commands. If a block reader exists and is not complete, usable() returns false so the socket won't be reused. But on line 701, it calls entry.close() which calls through to close the underlying socket. This will destroy the read currently going on.
          Hide
          bc Wong added a comment -

          I replaced the size-of-one cache with a more generic cache, which is also a global shared cache. There is a new TestParallelRead, which test the concurrent use of a DFSInputStream with concurrent readers. There's a clear speed difference with vs without the patch. Each thread does 1024 # of reads.

          Trunk:

          Report: 4 threads read 236953 KB (across 1 file(s)) in 5.879s; average 40304.98384078925 KB/s
          Report: 4 threads read 238873 KB (across 1 file(s)) in 5.063s; average 47180.13035749556 KB/s
          Report: 4 threads read 236068 KB (across 1 file(s)) in 5.93s; average 39809.10623946037 KB/s
          Report: 16 threads read 942666 KB (across 1 file(s)) in 13.524s; average 69703.19432120674 KB/s
          Report: 16 threads read 947015 KB (across 1 file(s)) in 13.401s; average 70667.48750093277 KB/s
          Report: 16 threads read 948768 KB (across 1 file(s)) in 12.932s; average 73365.91401175379 KB/s
          Report: 8 threads read 469529 KB (across 2 file(s)) in 5.436s; average 86373.98822663723 KB/s
          Report: 8 threads read 455428 KB (across 2 file(s)) in 5.363s; average 84920.38038411336 KB/s
          Report: 8 threads read 469005 KB (across 2 file(s)) in 5.713s; average 82094.34622790127 KB/s
          

          Patched:

          Report: 4 threads read 236845 KB (across 1 file(s)) in 3.612s; average 65571.70542635658 KB/s
          Report: 4 threads read 238803 KB (across 1 file(s)) in 4.371s; average 54633.49347975291 KB/s
          Report: 4 threads read 240241 KB (across 1 file(s)) in 4.395s; average 54662.34357224119 KB/s
          Report: 16 threads read 938652 KB (across 1 file(s)) in 9.044s; average 103787.26227333037 KB/s
          Report: 16 threads read 943999 KB (across 1 file(s)) in 8.59s; average 109895.11059371362 KB/s
          Report: 16 threads read 938546 KB (across 1 file(s)) in 9.081s; average 103352.71445876005 KB/s
          Report: 8 threads read 478534 KB (across 2 file(s)) in 3.376s; average 141745.85308056872 KB/s
          Report: 8 threads read 467412 KB (across 2 file(s)) in 3.623s; average 129012.42064587357 KB/s
          Report: 8 threads read 475349 KB (across 2 file(s)) in 3.49s; average 136203.15186246418 KB/s
          

          The edits to the docs in DataNode.java are good - if possible they should probably move into HDFS-1001 though, no?

          The addition to the docs doesn't apply to HDFS-1001, in which the DataXceiver still actively closes all sockets after each use.

          Todd, the new patch addresses the rest of your comments.

          Show
          bc Wong added a comment - I replaced the size-of-one cache with a more generic cache, which is also a global shared cache. There is a new TestParallelRead, which test the concurrent use of a DFSInputStream with concurrent readers. There's a clear speed difference with vs without the patch. Each thread does 1024 # of reads. Trunk: Report: 4 threads read 236953 KB (across 1 file(s)) in 5.879s; average 40304.98384078925 KB/s Report: 4 threads read 238873 KB (across 1 file(s)) in 5.063s; average 47180.13035749556 KB/s Report: 4 threads read 236068 KB (across 1 file(s)) in 5.93s; average 39809.10623946037 KB/s Report: 16 threads read 942666 KB (across 1 file(s)) in 13.524s; average 69703.19432120674 KB/s Report: 16 threads read 947015 KB (across 1 file(s)) in 13.401s; average 70667.48750093277 KB/s Report: 16 threads read 948768 KB (across 1 file(s)) in 12.932s; average 73365.91401175379 KB/s Report: 8 threads read 469529 KB (across 2 file(s)) in 5.436s; average 86373.98822663723 KB/s Report: 8 threads read 455428 KB (across 2 file(s)) in 5.363s; average 84920.38038411336 KB/s Report: 8 threads read 469005 KB (across 2 file(s)) in 5.713s; average 82094.34622790127 KB/s Patched: Report: 4 threads read 236845 KB (across 1 file(s)) in 3.612s; average 65571.70542635658 KB/s Report: 4 threads read 238803 KB (across 1 file(s)) in 4.371s; average 54633.49347975291 KB/s Report: 4 threads read 240241 KB (across 1 file(s)) in 4.395s; average 54662.34357224119 KB/s Report: 16 threads read 938652 KB (across 1 file(s)) in 9.044s; average 103787.26227333037 KB/s Report: 16 threads read 943999 KB (across 1 file(s)) in 8.59s; average 109895.11059371362 KB/s Report: 16 threads read 938546 KB (across 1 file(s)) in 9.081s; average 103352.71445876005 KB/s Report: 8 threads read 478534 KB (across 2 file(s)) in 3.376s; average 141745.85308056872 KB/s Report: 8 threads read 467412 KB (across 2 file(s)) in 3.623s; average 129012.42064587357 KB/s Report: 8 threads read 475349 KB (across 2 file(s)) in 3.49s; average 136203.15186246418 KB/s The edits to the docs in DataNode.java are good - if possible they should probably move into HDFS-1001 though, no? The addition to the docs doesn't apply to HDFS-1001 , in which the DataXceiver still actively closes all sockets after each use. Todd, the new patch addresses the rest of your comments.
          Hide
          Todd Lipcon added a comment -

          Great benchmark results, especially considering this is pseudodistributed, so there isn't much latency between client and DN. I imagine across an actual network it will be even more dramatic.

          I'll take a look at your patch later this week.

          Show
          Todd Lipcon added a comment - Great benchmark results, especially considering this is pseudodistributed, so there isn't much latency between client and DN. I imagine across an actual network it will be even more dramatic. I'll take a look at your patch later this week.
          Hide
          bc Wong added a comment -

          Rebased HDFS-941-3.patch:

          • Fixed spurious exception
          • Closed sockets as soon as we realize they're bad.
          Show
          bc Wong added a comment - Rebased HDFS-941 -3.patch: Fixed spurious exception Closed sockets as soon as we realize they're bad.
          Hide
          bc Wong added a comment -

          Ousp. The previous patch was in the reverse direction.

          Show
          bc Wong added a comment - Ousp. The previous patch was in the reverse direction.
          Hide
          Eli Collins added a comment -

          Hey bc,

          Nice change!

          Do you have any results from a non-random workload? Please collect:

          1. before/after TestDFSIO runs so we can see if sequential throughput is affected
          2. hadoop fs -put of a 1g file from n clients in parallel. I suspect this will improve, socket resuse should limit slow start but good to check.

          How did you choose DEFAULT_CACHE_SIZE?

          In the exception handler in sendReadResult can we be more specific about when it's OK not to be able to send the result, and throw an exception in the cases when it's no OK rather than swallowing all IOExceptions?

          In DataXceiver#opReadBlock you throw an IOException in a try block that catches IOException. I think that should LOG.error and close the output stream. You can also chain the following if statements that check stat.

          How about asserting sock != null in putCachedSocket? Seems like this should never happen if the code is correct and it's easy to ignore logs.

          File a jira for ERROR_CHECKSUM?

          Please add a comment to the head of ReaderSocketCache explaining why we cache BlockReader socket pairs, as opposed to just caching sockets (because we don't multiplex BlockReaders over a single socket between hosts).

          Nits:

          • Nice comment in the BlockReader header, please define "packet" as well. Is the RPC specification in DataNode outdated? If so fix it or file a jira instead of warning readers it may be outdated.
          • Maybe better name for DN_KEEPALIVE_TIMEOUT since there is no explicit keepalive? TRANSFER_TIMEOUT?
          • Would rename workDone to something specific like opsProcessed or make it a boolean
          • Add an "a" in "with checksum"
          • if needs braces eg BlockReader#read

          Thanks,
          Eli

          Show
          Eli Collins added a comment - Hey bc, Nice change! Do you have any results from a non-random workload? Please collect: before/after TestDFSIO runs so we can see if sequential throughput is affected hadoop fs -put of a 1g file from n clients in parallel. I suspect this will improve, socket resuse should limit slow start but good to check. How did you choose DEFAULT_CACHE_SIZE? In the exception handler in sendReadResult can we be more specific about when it's OK not to be able to send the result, and throw an exception in the cases when it's no OK rather than swallowing all IOExceptions? In DataXceiver#opReadBlock you throw an IOException in a try block that catches IOException. I think that should LOG.error and close the output stream. You can also chain the following if statements that check stat. How about asserting sock != null in putCachedSocket? Seems like this should never happen if the code is correct and it's easy to ignore logs. File a jira for ERROR_CHECKSUM? Please add a comment to the head of ReaderSocketCache explaining why we cache BlockReader socket pairs, as opposed to just caching sockets (because we don't multiplex BlockReaders over a single socket between hosts). Nits: Nice comment in the BlockReader header, please define "packet" as well. Is the RPC specification in DataNode outdated? If so fix it or file a jira instead of warning readers it may be outdated. Maybe better name for DN_KEEPALIVE_TIMEOUT since there is no explicit keepalive? TRANSFER_TIMEOUT? Would rename workDone to something specific like opsProcessed or make it a boolean Add an "a" in "with checksum" if needs braces eg BlockReader#read Thanks, Eli
          Hide
          Eli Collins added a comment -

          hadoop fs -put of a 1g file from n clients in parallel. I suspect this will improve, socket resuse should limit slow start but good to check.

          Meant fs -get here since we're caching sockets on reads and not writes. I think the DFSInputStream currently creates a new socket for each block it fetches.

          Show
          Eli Collins added a comment - hadoop fs -put of a 1g file from n clients in parallel. I suspect this will improve, socket resuse should limit slow start but good to check. Meant fs -get here since we're caching sockets on reads and not writes. I think the DFSInputStream currently creates a new socket for each block it fetches.
          Hide
          bc Wong added a comment -

          The variance is large on the tests. But they show that the patch isn't slower than trunk. Tests executed on a 5 node cluster:

          • TestDFSIO -read -fileSize 512 -bufferSize 4096 -nrFiles 10
          - trunk patched
          Num trials 6 5
          Throughput (MB/s) 92 93
          Avg IO (MB/s) 150 134
          Std dev 122 77
          • TestDFSIO -read -fileSize 512 -bufferSize 4096 -nrFiles 20
          - trunk patched
          Num trials 5 5
          Throughput (MB/s) 78 83
          Avg IO (MB/s) 114 121
          Std dev 75 76
          • Distributed bin/hadoop fs -cat /benchmarks/TestDFSIO/io_data/test_io_$i > /dev/null, for i in [0,9]
          - trunk patched
          Num trials 5 5
          Avg time (sec) 47.8 48.0
          Std dev 4.2 3.6
          Show
          bc Wong added a comment - The variance is large on the tests. But they show that the patch isn't slower than trunk. Tests executed on a 5 node cluster: TestDFSIO -read -fileSize 512 -bufferSize 4096 -nrFiles 10 - trunk patched Num trials 6 5 Throughput (MB/s) 92 93 Avg IO (MB/s) 150 134 Std dev 122 77 TestDFSIO -read -fileSize 512 -bufferSize 4096 -nrFiles 20 - trunk patched Num trials 5 5 Throughput (MB/s) 78 83 Avg IO (MB/s) 114 121 Std dev 75 76 Distributed bin/hadoop fs -cat /benchmarks/TestDFSIO/io_data/test_io_$i > /dev/null , for i in [0,9] - trunk patched Num trials 5 5 Avg time (sec) 47.8 48.0 Std dev 4.2 3.6
          Hide
          bc Wong added a comment -

          (Eli) Maybe better name for DN_KEEPALIVE_TIMEOUT since there is no explicit keepalive? TRANSFER_TIMEOUT?

          The keepalive is not explicit. But that's essentially what the DN is doing. I find it more descriptive. TRANSFER_TIMEOUT sounds like timeout in the middle of a transfer.

          New -4 patch per Eli's comments.

          Show
          bc Wong added a comment - (Eli) Maybe better name for DN_KEEPALIVE_TIMEOUT since there is no explicit keepalive? TRANSFER_TIMEOUT? The keepalive is not explicit. But that's essentially what the DN is doing. I find it more descriptive. TRANSFER_TIMEOUT sounds like timeout in the middle of a transfer. New -4 patch per Eli's comments.
          Hide
          sam rash added a comment -

          +1 for the idea of caching sockets, but I have some questions/concerns about the implementation.
          some comments:

          1. avoid making the cache implementation tied to the class ReaderSocketCache. Don't make the cache a static member of the same class. Let the cache be an instantiable
          object. Let DFSClient store the cache either as an instance or static var (don't force everything to use the same cache instance--better for testing and stubbing out as well)
          2. a lot of the logic around re-using is complicated--I think this could be simplified
          a. not clear why sockets are always in the cache even if not usable: i would think adding only when usable and removing when used would be cleaner?
          b. if we can keep the cache clean, no need for lazy removal of unusable sockets
          3. shouldn't there be a cap on the # of sockets there can be in the cache?
          -again, should only be usable ones, but a max # put into the cache makes sense. If we have a flurry of reads using tons of sockets to several DNs, no need to keep 100s or more sockets in a cache
          4. general concern about potential socket leaks;
          5. seems like this needs more thought into the effects of synchronization: the freemap has to be traversed every time to get a socket in a sync block. see above if we can
          avoid lazy removal by not putting unusable sockets in the cache (unsuable either since they are in use or not usable at all)
          6. do we have real performance benchmarks from actual clusters that show a significant benefit? as noted above, the change is fairly complex (caching is in fact hard
          and if we don't see a substantial performance improvement, the risk of bugs may outweigh the benefit

          that's my 2c anyway

          -sr

          Show
          sam rash added a comment - +1 for the idea of caching sockets, but I have some questions/concerns about the implementation. some comments: 1. avoid making the cache implementation tied to the class ReaderSocketCache. Don't make the cache a static member of the same class. Let the cache be an instantiable object. Let DFSClient store the cache either as an instance or static var (don't force everything to use the same cache instance--better for testing and stubbing out as well) 2. a lot of the logic around re-using is complicated--I think this could be simplified a. not clear why sockets are always in the cache even if not usable: i would think adding only when usable and removing when used would be cleaner? b. if we can keep the cache clean, no need for lazy removal of unusable sockets 3. shouldn't there be a cap on the # of sockets there can be in the cache? -again, should only be usable ones, but a max # put into the cache makes sense. If we have a flurry of reads using tons of sockets to several DNs, no need to keep 100s or more sockets in a cache 4. general concern about potential socket leaks; 5. seems like this needs more thought into the effects of synchronization: the freemap has to be traversed every time to get a socket in a sync block. see above if we can avoid lazy removal by not putting unusable sockets in the cache (unsuable either since they are in use or not usable at all) 6. do we have real performance benchmarks from actual clusters that show a significant benefit? as noted above, the change is fairly complex (caching is in fact hard and if we don't see a substantial performance improvement, the risk of bugs may outweigh the benefit that's my 2c anyway -sr
          Hide
          bc Wong added a comment -

          Thanks for looking at the patch, Sam.

          avoid making the cache implementation tied to the class ReaderSocketCache.

          Do you mean separating out a more generic cache class? I've thought about that quite a bit. But in the end, the socket cache I need is too tied to the BlockReader (i.e. the concept of "owner"), because the BlockReader doesn't multiplex on the same socket. Detaching the BlockReader from the CacheEntry could mean having an extra interface, which seems unnecessary until another usage scenario arrives. We can factor it out later when needed, since the caching implementation shouldn't affect the API/protocol.

          Don't make the cache a static member of the same class. Let the cache be an instantiable object. Let DFSClient store the cache either as an instance or static var

          Sounds good. I'll storing it in the DFSClient.

          not clear why sockets are always in the cache even if not usable

          Only usable sockets are put into the cache. However, sockets can become ususable while they sit in the cache. The connection could drop. The DN side closes it after a short wait. So when a client wants to retrieve a socket, it might be bad. Here, we could:

          1. Handle that from within the cache and do our best to return a good socket to the client, or
          2. Let the client check the socket and retry.

          I went for (1) because it's less waste. Sorry about the misunderstanding. I'll put in more comments.

          shouldn't there be a cap on the # of sockets there can be in the cache?

          There is. The code checks against cacheLimit when adding sockets to the free map.

          the freemap has to be traversed every time to get a socket in a sync block.

          True. I think it only sounds bad (but actually isn't). The reader has to read a lot of blocks simultaneously to see contention. First, the cacheLimit should likely be small (default to 16) for performance reason, to avoid hitting too many unusable sockets. Second, it's only traversing through a portion of the freeMap, the list of sockets to the same target IP. Plus, I kind of want it to check for unusable sockets before returning to the caller. I feel that a finer locking granularity here could actually make it slower.

          do we have real performance benchmarks from actual clusters that show a significant benefit?

          I'll ask Todd to provide more details. He was testing a backported version of the patch with HBase. Preliminary results showed a 20% gain.

          Show
          bc Wong added a comment - Thanks for looking at the patch, Sam. avoid making the cache implementation tied to the class ReaderSocketCache. Do you mean separating out a more generic cache class? I've thought about that quite a bit. But in the end, the socket cache I need is too tied to the BlockReader (i.e. the concept of "owner"), because the BlockReader doesn't multiplex on the same socket. Detaching the BlockReader from the CacheEntry could mean having an extra interface, which seems unnecessary until another usage scenario arrives. We can factor it out later when needed, since the caching implementation shouldn't affect the API/protocol. Don't make the cache a static member of the same class. Let the cache be an instantiable object. Let DFSClient store the cache either as an instance or static var Sounds good. I'll storing it in the DFSClient. not clear why sockets are always in the cache even if not usable Only usable sockets are put into the cache. However, sockets can become ususable while they sit in the cache. The connection could drop. The DN side closes it after a short wait. So when a client wants to retrieve a socket, it might be bad. Here, we could: Handle that from within the cache and do our best to return a good socket to the client, or Let the client check the socket and retry. I went for (1) because it's less waste. Sorry about the misunderstanding. I'll put in more comments. shouldn't there be a cap on the # of sockets there can be in the cache? There is. The code checks against cacheLimit when adding sockets to the free map. the freemap has to be traversed every time to get a socket in a sync block. True. I think it only sounds bad (but actually isn't). The reader has to read a lot of blocks simultaneously to see contention. First, the cacheLimit should likely be small (default to 16) for performance reason, to avoid hitting too many unusable sockets. Second, it's only traversing through a portion of the freeMap , the list of sockets to the same target IP. Plus, I kind of want it to check for unusable sockets before returning to the caller. I feel that a finer locking granularity here could actually make it slower. do we have real performance benchmarks from actual clusters that show a significant benefit? I'll ask Todd to provide more details. He was testing a backported version of the patch with HBase. Preliminary results showed a 20% gain.
          Hide
          Todd Lipcon added a comment -

          I ran some benchmarks again tonight using YCSB.

          I loaded 1M rows into an HBase table (untimed) on my test cluster. The cluster is running a 5-node HDFS, but I only ran one HBase region server, so that I could reliably have the same region deployment between test runs. The data fits entirely within the buffer cache, so we're just benchmarking DFS overhead and not actual seek time.

          I ran benchmarks with:

          java -cp build/ycsb.jar:src/com/yahoo/ycsb/db/hbaselib/*:$HBASE_CONF_DIR com.yahoo.ycsb.Client  -db com.yahoo.ycsb.db.HBaseClient -threads 40 -t -p columnfamily=test -P workloads/workloadc -p recordcount=$[1000*1000] -p operationcount=$[1000*1000]
          

          from one of the nodes in the cluster (not the same one as ran the region server)

          I ran the benchmark twice without the patch and twice with, alternating builds and restarting DFS and HBase each time, to make sure I wasn't getting any variability due to caching, etc.

          Results follow:

          ==> 941-bench-1.txt <==
          YCSB Client 0.1
          Command line: -db com.yahoo.ycsb.db.HBaseClient -threads 40 -t -p columnfamily=test -P workloads/workloadc -p recordcount=1000000 -p operationcount=1000000
          [OVERALL],RunTime(ms), 118197
          [OVERALL],Throughput(ops/sec), 8460.451618907417
          [READ], Operations, 1000000
          [READ], AverageLatency(ms), 4.701651
          [READ], MinLatency(ms), 0
          [READ], MaxLatency(ms), 1352
          [READ], 95thPercentileLatency(ms), 11
          [READ], 99thPercentileLatency(ms), 15

          ==> 941-bench-2.txt <==
          YCSB Client 0.1
          Command line: -db com.yahoo.ycsb.db.HBaseClient -threads 40 -t -p columnfamily=test -P workloads/workloadc -p recordcount=1000000 -p operationcount=1000000
          [OVERALL],RunTime(ms), 124005
          [OVERALL],Throughput(ops/sec), 8064.190960041934
          [READ], Operations, 1000000
          [READ], AverageLatency(ms), 4.940652
          [READ], MinLatency(ms), 0
          [READ], MaxLatency(ms), 1337
          [READ], 95thPercentileLatency(ms), 12
          [READ], 99thPercentileLatency(ms), 16

          ==> normal-bench-1.txt <==
          YCSB Client 0.1
          Command line: -db com.yahoo.ycsb.db.HBaseClient -threads 40 -t -p columnfamily=test -P workloads/workloadc -p recordcount=1000000 -p operationcount=1000000
          [OVERALL],RunTime(ms), 182316
          [OVERALL],Throughput(ops/sec), 5484.982118958293
          [READ], Operations, 1000000
          [READ], AverageLatency(ms), 7.267306
          [READ], MinLatency(ms), 0
          [READ], MaxLatency(ms), 1327
          [READ], 95thPercentileLatency(ms), 17
          [READ], 99thPercentileLatency(ms), 26

          ==> normal-bench-2.txt <==
          YCSB Client 0.1
          Command line: -db com.yahoo.ycsb.db.HBaseClient -threads 40 -t -p columnfamily=test -P workloads/workloadc -p recordcount=1000000 -p operationcount=1000000
          [OVERALL],RunTime(ms), 190053
          [OVERALL],Throughput(ops/sec), 5261.690160113231
          [READ], Operations, 1000000
          [READ], AverageLatency(ms), 7.577673
          [READ], MinLatency(ms), 0
          [READ], MaxLatency(ms), 1525
          [READ], 95thPercentileLatency(ms), 15
          [READ], 99thPercentileLatency(ms), 21

          In other words, this patch speeds up average latency by nearly 40%, with similar gains on the high percentile latencies. The reads/sec number improved by about 35%.

          This is without any tuning of the keepalive or the socket cache size - I imagine even more improvement could be made with a bit more tuning, etc.

          Show
          Todd Lipcon added a comment - I ran some benchmarks again tonight using YCSB. I loaded 1M rows into an HBase table (untimed) on my test cluster. The cluster is running a 5-node HDFS, but I only ran one HBase region server, so that I could reliably have the same region deployment between test runs. The data fits entirely within the buffer cache, so we're just benchmarking DFS overhead and not actual seek time. I ran benchmarks with: java -cp build/ycsb.jar:src/com/yahoo/ycsb/db/hbaselib/*:$HBASE_CONF_DIR com.yahoo.ycsb.Client -db com.yahoo.ycsb.db.HBaseClient -threads 40 -t -p columnfamily=test -P workloads/workloadc -p recordcount=$[1000*1000] -p operationcount=$[1000*1000] from one of the nodes in the cluster (not the same one as ran the region server) I ran the benchmark twice without the patch and twice with, alternating builds and restarting DFS and HBase each time, to make sure I wasn't getting any variability due to caching, etc. Results follow: ==> 941-bench-1.txt <== YCSB Client 0.1 Command line: -db com.yahoo.ycsb.db.HBaseClient -threads 40 -t -p columnfamily=test -P workloads/workloadc -p recordcount=1000000 -p operationcount=1000000 [OVERALL] ,RunTime(ms), 118197 [OVERALL] ,Throughput(ops/sec), 8460.451618907417 [READ] , Operations, 1000000 [READ] , AverageLatency(ms), 4.701651 [READ] , MinLatency(ms), 0 [READ] , MaxLatency(ms), 1352 [READ] , 95thPercentileLatency(ms), 11 [READ] , 99thPercentileLatency(ms), 15 ==> 941-bench-2.txt <== YCSB Client 0.1 Command line: -db com.yahoo.ycsb.db.HBaseClient -threads 40 -t -p columnfamily=test -P workloads/workloadc -p recordcount=1000000 -p operationcount=1000000 [OVERALL] ,RunTime(ms), 124005 [OVERALL] ,Throughput(ops/sec), 8064.190960041934 [READ] , Operations, 1000000 [READ] , AverageLatency(ms), 4.940652 [READ] , MinLatency(ms), 0 [READ] , MaxLatency(ms), 1337 [READ] , 95thPercentileLatency(ms), 12 [READ] , 99thPercentileLatency(ms), 16 ==> normal-bench-1.txt <== YCSB Client 0.1 Command line: -db com.yahoo.ycsb.db.HBaseClient -threads 40 -t -p columnfamily=test -P workloads/workloadc -p recordcount=1000000 -p operationcount=1000000 [OVERALL] ,RunTime(ms), 182316 [OVERALL] ,Throughput(ops/sec), 5484.982118958293 [READ] , Operations, 1000000 [READ] , AverageLatency(ms), 7.267306 [READ] , MinLatency(ms), 0 [READ] , MaxLatency(ms), 1327 [READ] , 95thPercentileLatency(ms), 17 [READ] , 99thPercentileLatency(ms), 26 ==> normal-bench-2.txt <== YCSB Client 0.1 Command line: -db com.yahoo.ycsb.db.HBaseClient -threads 40 -t -p columnfamily=test -P workloads/workloadc -p recordcount=1000000 -p operationcount=1000000 [OVERALL] ,RunTime(ms), 190053 [OVERALL] ,Throughput(ops/sec), 5261.690160113231 [READ] , Operations, 1000000 [READ] , AverageLatency(ms), 7.577673 [READ] , MinLatency(ms), 0 [READ] , MaxLatency(ms), 1525 [READ] , 95thPercentileLatency(ms), 15 [READ] , 99thPercentileLatency(ms), 21 In other words, this patch speeds up average latency by nearly 40%, with similar gains on the high percentile latencies. The reads/sec number improved by about 35%. This is without any tuning of the keepalive or the socket cache size - I imagine even more improvement could be made with a bit more tuning, etc.
          Hide
          Todd Lipcon added a comment -

          I'd like to hold off on this just a bit longer yet - I'm seeing this sporadically in my testing:

          Caused by: java.lang.IndexOutOfBoundsException
          at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:151)
          at org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1155)
          at org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:384)
          at org.apache.hadoop.hdfs.DFSClient$BlockReader.readAll(DFSClient.java:1441)
          at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.fetchBlockByteRange(DFSClient.java:1913)
          at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2035)
          at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:46)

          But the above benchmarks do show that the idea has a lot of promise! (and the above trace may in fact be an HBase bug)

          Show
          Todd Lipcon added a comment - I'd like to hold off on this just a bit longer yet - I'm seeing this sporadically in my testing: Caused by: java.lang.IndexOutOfBoundsException at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:151) at org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1155) at org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:384) at org.apache.hadoop.hdfs.DFSClient$BlockReader.readAll(DFSClient.java:1441) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.fetchBlockByteRange(DFSClient.java:1913) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2035) at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:46) But the above benchmarks do show that the idea has a lot of promise! (and the above trace may in fact be an HBase bug)
          Hide
          sam rash added a comment -

          todd: wow, those benchmarks do look impressive! do we have any idea if standard sequential access gets any benefit?

          bc: my point about the cache is you don't have to hard-code it as a static member of ReaderSocketCache. I don't think it needs to be more generic--it can be a socket cache. I do think it can be decoupled from BlockReader by getting rid
          of having owner.

          why does a 'cache' create sockets? you can avoid the whole owner problem if you simply let the client ask for a socket, and if there is none, create its own, use it, and put it in the cache when its done with it (ie, it's usable). This
          should greatly reduce complexity (no need for free + used separately, owner, etc). It seems like this is mixing up responsibilities of being a socket factory and a socket cache (possibly why it seems complex to me)

          <code>
          boolean reusable()

          { return ((owner == null || owner.hasConsumedAll()) && sock.isConnected() && !sock.isInputShutdown() && !sock.isOutputShutdown()); }

          </code>

          will only check socket if you can make this change

          Show
          sam rash added a comment - todd: wow, those benchmarks do look impressive! do we have any idea if standard sequential access gets any benefit? bc: my point about the cache is you don't have to hard-code it as a static member of ReaderSocketCache. I don't think it needs to be more generic--it can be a socket cache. I do think it can be decoupled from BlockReader by getting rid of having owner. why does a 'cache' create sockets? you can avoid the whole owner problem if you simply let the client ask for a socket, and if there is none, create its own, use it, and put it in the cache when its done with it (ie, it's usable). This should greatly reduce complexity (no need for free + used separately, owner, etc). It seems like this is mixing up responsibilities of being a socket factory and a socket cache (possibly why it seems complex to me) <code> boolean reusable() { return ((owner == null || owner.hasConsumedAll()) && sock.isConnected() && !sock.isInputShutdown() && !sock.isOutputShutdown()); } </code> will only check socket if you can make this change
          Hide
          sam rash added a comment -

          two other comments:

          1. the number of sockets per address is limited, but not the number of addresses. This may in practice not be a problem, but the cache can in theory grow very large
          2. the usedmap seems like a good place for a memory/object leak: I can take a socket and never return it (why again, I vote for getting rid of this data structure period--as far as a cache is concerned, an entry that someone else owns shouldn't even be there). otherwise, you've got to periodically clean this map up as well. Seems like it's only used for stats which I think you can do w/o actually keeping a hash of used sockets.

          Show
          sam rash added a comment - two other comments: 1. the number of sockets per address is limited, but not the number of addresses. This may in practice not be a problem, but the cache can in theory grow very large 2. the usedmap seems like a good place for a memory/object leak: I can take a socket and never return it (why again, I vote for getting rid of this data structure period--as far as a cache is concerned, an entry that someone else owns shouldn't even be there). otherwise, you've got to periodically clean this map up as well. Seems like it's only used for stats which I think you can do w/o actually keeping a hash of used sockets.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12443322/HDFS-941-4.patch
          against trunk revision 1051669.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 15 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/31//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12443322/HDFS-941-4.patch against trunk revision 1051669. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 15 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/31//console This message is automatically generated.
          Hide
          bc Wong added a comment -

          Apologies for having neglected this patch. I look forward to some quite focus time over the next month to rebase it, and to implement Sam's comments.

          Show
          bc Wong added a comment - Apologies for having neglected this patch. I look forward to some quite focus time over the next month to rebase it, and to implement Sam's comments.
          Hide
          sam rash added a comment -

          btw, this is really great as a cache that handles max size of a cache, time-based expiration, and lazy creation of values based on keys (only once):

          http://guava-libraries.googlecode.com/svn/trunk/javadoc/com/google/common/collect/MapMaker.html

          see the 'computeMap' option. It might simplify this code

          Show
          sam rash added a comment - btw, this is really great as a cache that handles max size of a cache, time-based expiration, and lazy creation of values based on keys (only once): http://guava-libraries.googlecode.com/svn/trunk/javadoc/com/google/common/collect/MapMaker.html see the 'computeMap' option. It might simplify this code
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12443322/HDFS-941-4.patch
          against trunk revision 1072023.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 15 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/199//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12443322/HDFS-941-4.patch against trunk revision 1072023. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 15 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/199//console This message is automatically generated.
          Hide
          bc Wong added a comment -

          Greatly simplified the socket cache.

          • Sam, thanks for the MapMaker pointer. It looks really useful. But I need a multimap, which the MapMaker doesn't support. (But I found the multimap library in the same place.)
          • The socket cache also needs to close the sockets when it evicts them, which makes it socket specific unfortunately.
          • I'm aware of the giant lock in the cache. This is what I don't know: Will there be high contention? I assume that most of the time, clients are doing i/o or processing the data read, that the synchronization overhead here is relatively small. We could do finer grain locking, but I don't want to optimize something that's not gonna be a bottleneck.
          Show
          bc Wong added a comment - Greatly simplified the socket cache. Sam, thanks for the MapMaker pointer. It looks really useful. But I need a multimap, which the MapMaker doesn't support. (But I found the multimap library in the same place.) The socket cache also needs to close the sockets when it evicts them, which makes it socket specific unfortunately. I'm aware of the giant lock in the cache. This is what I don't know: Will there be high contention? I assume that most of the time, clients are doing i/o or processing the data read, that the synchronization overhead here is relatively small. We could do finer grain locking, but I don't want to optimize something that's not gonna be a bottleneck.
          Hide
          Kihwal Lee added a comment -

          Nice work! I performed a basic test and got results comparable to the one from your previous patch. I will review the patch in depth soon.

          Show
          Kihwal Lee added a comment - Nice work! I performed a basic test and got results comparable to the one from your previous patch. I will review the patch in depth soon.
          Hide
          Kihwal Lee added a comment -

          +1 The patch looks good. I was unsure about the new dependency on Guava, but apparently people have already agreed on adding it to hadoop-common, so I guess it's not an issue.

          Show
          Kihwal Lee added a comment - +1 The patch looks good. I was unsure about the new dependency on Guava, but apparently people have already agreed on adding it to hadoop-common, so I guess it's not an issue.
          Hide
          Kihwal Lee added a comment -

          This is from my own pread test. Local/Remote preads benefit from this patch if the content is cached in the page cache.

          Show
          Kihwal Lee added a comment - This is from my own pread test. Local/Remote preads benefit from this patch if the content is cached in the page cache.
          Hide
          stack added a comment -

          +1 on commit. Patch looks great though a bit hard to read because its mostly white-space changes. I like the tests. I"m good w/ adding guava.

          If a v6, here a few minor comment:

          Javadoc on BlockReader is not properly formatted (will show as mess after html'ing) – same for class comment on DN.

          gotEOS is odd name for a boolean, would think eos better?

          Hard-codings like this, + final int MAX_RETRIES = 3;, should be instead gotten from config. even if not declared in hdfs-default.xml? Same for DN_KEEPALIVE_TIMEOUT.

          Why would we retry a socket that is throwing an IOE? Why not close and move on with new socket?

          Is SocketCache missing a copyright notice?

          Is this the right thing to do?

          +    SocketAddress remoteAddr = sock.getRemoteSocketAddress();
          +    if (remoteAddr == null) {
          +      return;
          +    }
          

          The socket is not cached because it does not have a remote address. Why does it not have a remote address. Is there something wrong w/ the socket? Should we throw and exception or close and throw away the socket?

          There is a tab at #1242 in patch:

          +	// restore normal timeout
          Show
          stack added a comment - +1 on commit. Patch looks great though a bit hard to read because its mostly white-space changes. I like the tests. I"m good w/ adding guava. If a v6, here a few minor comment: Javadoc on BlockReader is not properly formatted (will show as mess after html'ing) – same for class comment on DN. gotEOS is odd name for a boolean, would think eos better? Hard-codings like this, + final int MAX_RETRIES = 3;, should be instead gotten from config. even if not declared in hdfs-default.xml? Same for DN_KEEPALIVE_TIMEOUT. Why would we retry a socket that is throwing an IOE? Why not close and move on with new socket? Is SocketCache missing a copyright notice? Is this the right thing to do? + SocketAddress remoteAddr = sock.getRemoteSocketAddress(); + if (remoteAddr == null ) { + return ; + } The socket is not cached because it does not have a remote address. Why does it not have a remote address. Is there something wrong w/ the socket? Should we throw and exception or close and throw away the socket? There is a tab at #1242 in patch: + // restore normal timeout
          Hide
          bc Wong added a comment -

          Thanks for the feedback, Stack. The v6 patch:

          • Undid the whitespace changes. My IDE was misconfigured.
          • Turned hardcoded values into config keys.
          • Fixed javadoc on BlockReader.
          • Renamed gotEOS to eos.
          • Handled null remoteAddr by closing the socket.

          Did not work on:

          • DN javadoc. I don't think I touched it.
          • SocketCache copyright. It's there, I think.

          I don't follow this (not sure what you're referring to):

          Why would we retry a socket that is throwing an IOE?

          Show
          bc Wong added a comment - Thanks for the feedback, Stack. The v6 patch: Undid the whitespace changes. My IDE was misconfigured. Turned hardcoded values into config keys. Fixed javadoc on BlockReader. Renamed gotEOS to eos . Handled null remoteAddr by closing the socket. Did not work on: DN javadoc. I don't think I touched it. SocketCache copyright. It's there, I think. I don't follow this (not sure what you're referring to): Why would we retry a socket that is throwing an IOE?
          Hide
          stack added a comment -

          On 'Why would we retry a socket that is throwing an IOE?', I went through your patch again and cannot figure to what I was referring tooo. Please excuse me.

          On v6, I took a quick look through and it looks good to me, +1 (Nice javadoc on BlockReader and elsewhere). Want to try submitting it to hudson?

          Show
          stack added a comment - On 'Why would we retry a socket that is throwing an IOE?', I went through your patch again and cannot figure to what I was referring tooo. Please excuse me. On v6, I took a quick look through and it looks good to me, +1 (Nice javadoc on BlockReader and elsewhere). Want to try submitting it to hudson?
          Hide
          bc Wong added a comment -

          Submitting a `HDFS-941-6.patch' attachment again to kick Hudson.

          Show
          bc Wong added a comment - Submitting a ` HDFS-941 -6.patch' attachment again to kick Hudson.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12476021/HDFS-941-6.patch
          against trunk revision 1091131.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 15 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/340//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12476021/HDFS-941-6.patch against trunk revision 1091131. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 15 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/340//console This message is automatically generated.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Hi bc, seems that Jenkins (previously Hudson) sometimes does not pick up patches. I just have submitted this manually.

          Show
          Tsz Wo Nicholas Sze added a comment - Hi bc, seems that Jenkins (previously Hudson) sometimes does not pick up patches. I just have submitted this manually .
          Hide
          bc Wong added a comment -

          New patch file with --no-prefix.

          Show
          bc Wong added a comment - New patch file with --no-prefix.
          Hide
          bc Wong added a comment -

          Thanks Nicholas! I generated the wrong patch format, unfortunately. Could you help me submit it to Jenkins again?

          Show
          bc Wong added a comment - Thanks Nicholas! I generated the wrong patch format, unfortunately. Could you help me submit it to Jenkins again?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          You are welcome. Just have started it.

          Show
          Tsz Wo Nicholas Sze added a comment - You are welcome. Just have started it .
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12476027/HDFS-941-6.patch
          against trunk revision 1091131.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 15 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.hdfs.server.namenode.metrics.TestNameNodeMetrics
          org.apache.hadoop.hdfs.TestFileConcurrentReader

          -1 contrib tests. The patch failed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/343//testReport/
          Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/343//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/343//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12476027/HDFS-941-6.patch against trunk revision 1091131. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 15 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.server.namenode.metrics.TestNameNodeMetrics org.apache.hadoop.hdfs.TestFileConcurrentReader -1 contrib tests. The patch failed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/343//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/343//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/343//console This message is automatically generated.
          Hide
          bc Wong added a comment -

          I'll take a look at the TestFileConcurrentReader failure.

          Show
          bc Wong added a comment - I'll take a look at the TestFileConcurrentReader failure.
          Hide
          Todd Lipcon added a comment -

          TestFileConcurrentReader has been failing intermittently a lot for a while - it's likely this isn't related to the patch. But worth a quick look at least to see if this patch changes the intermittent failure to a reproducible one.

          Show
          Todd Lipcon added a comment - TestFileConcurrentReader has been failing intermittently a lot for a while - it's likely this isn't related to the patch. But worth a quick look at least to see if this patch changes the intermittent failure to a reproducible one.
          Hide
          sam rash added a comment -

          The last failure I saw with this test was basically unrelated to the test itself--it was a socket leak in the datanode, i think with RPCs.

          I glanced at the first test failure output and found a similar error:

          2011-04-11 21:29:36,962 INFO datanode.DataNode (DataXceiver.java:opWriteBlock(458)) - writeBlock blk_-6878114854540472276_1001 received exception java.io.FileNotFoundException: /grid/0/hudson/hudson-slave/workspace/PreCommit-HDFS-Build/trunk/build/test/data/dfs/data/data1/current/rbw/blk_-6878114854540472276_1001.meta (Too many open files)

          Note that this test implicitly finds any socket/fd leaks because it opens/closes files repeatedly.

          If you can check into this, that'd be great. I'll have some more time later this week to help more.

          Show
          sam rash added a comment - The last failure I saw with this test was basically unrelated to the test itself--it was a socket leak in the datanode, i think with RPCs. I glanced at the first test failure output and found a similar error: 2011-04-11 21:29:36,962 INFO datanode.DataNode (DataXceiver.java:opWriteBlock(458)) - writeBlock blk_-6878114854540472276_1001 received exception java.io.FileNotFoundException: /grid/0/hudson/hudson-slave/workspace/PreCommit-HDFS-Build/trunk/build/test/data/dfs/data/data1/current/rbw/blk_-6878114854540472276_1001.meta (Too many open files) Note that this test implicitly finds any socket/fd leaks because it opens/closes files repeatedly. If you can check into this, that'd be great. I'll have some more time later this week to help more.
          Hide
          Todd Lipcon added a comment -

          Looks pretty good, and I looped TestFileConcurrentReader for half an hour or so with no failures.

          A few small comments:

          • "google-collections" is deprecated in favor of the new name "guava" - we should depend on the newest
          • in TestParallelRead, you have a few cases of assert() where you should probably be using assertEquals() in case unit tests run without -ea. assertEquals() will also give a nicer error message
          • in SocketCache.evict(), you are calling multimap.remove while iterating over the same map's entries. This seems likely to throw ConcurrentModificationException. Better to use multimap.iterator() and call it.remove(). This makes me notice that you only call evict() ever with an argument of 1, so maybe you should just rename to evictOne()
          • If you have multiple DFSClient in a JVM with different socketTimeout settings, I think this will currently end up leaking timeouts between them. Perhaps after successfully getting a socket from socketCache, you need to call sock.setSoTimeout based on the current instance of dfsClient?
          Show
          Todd Lipcon added a comment - Looks pretty good, and I looped TestFileConcurrentReader for half an hour or so with no failures. A few small comments: "google-collections" is deprecated in favor of the new name "guava" - we should depend on the newest in TestParallelRead, you have a few cases of assert() where you should probably be using assertEquals() in case unit tests run without -ea. assertEquals() will also give a nicer error message in SocketCache.evict(), you are calling multimap.remove while iterating over the same map's entries. This seems likely to throw ConcurrentModificationException. Better to use multimap.iterator() and call it.remove() . This makes me notice that you only call evict() ever with an argument of 1, so maybe you should just rename to evictOne() If you have multiple DFSClient in a JVM with different socketTimeout settings, I think this will currently end up leaking timeouts between them. Perhaps after successfully getting a socket from socketCache, you need to call sock.setSoTimeout based on the current instance of dfsClient ?
          Show
          Jason Rutherglen added a comment - I'm seeing many errors trying to apply http://issues.apache.org/jira/secure/attachment/12476027/HDFS-941-6.patch to https://svn.apache.org/repos/asf/hadoop/hdfs/trunk
          Hide
          Todd Lipcon added a comment -

          Attached patch rebased against trunk and improved in the following ways:

          • moved SocketCache instance to DFSClient instead of being static
          • fixed bugs that could have caused concurrent modification exceptions in SocketCache itself – now uses iterator remove methods in eviction code.
          • fixed SocketCache.size() to be synchronized
          • renamed BlockSender.blockReadFully to sentEntireByteRange, and made it get set to true as soon as it has sent the entire requested length. This was necessary so that the client and server agree on when a status code will be expected.
          • DataXceiver: renamed the sockReuseTimeout to socketKeepaliveTimeout - I think this is a slightly clearer name
          • Fixed some assertions in new tests to use junit assertions instead of Java assertions (as suggested in an above comment)
          • changed TestParallelRead to disable the clienttrace log, since jstack was showing that it was causing a ton of contention
          • couple of misc style cleanups

          I also ran TestParallelRead before and after the patch, after bumping the N_ITERATIONS to 10240 and changing the proportion of non-positional reads to 0. The results are:

          without patch:
          11/06/05 20:32:54 INFO hdfs.TestParallelRead: === Report: 4 threads read 2619994 KB (across 1 file(s)) in 25.762s; average 101699.94565639313 KB/s
          11/06/05 20:33:34 INFO hdfs.TestParallelRead: === Report: 16 threads read 10470506 KB (across 1 file(s)) in 40.583s; average 258002.26695907154 KB/s
          11/06/05 20:34:00 INFO hdfs.TestParallelRead: === Report: 8 threads read 5232371 KB (across 2 file(s)) in 25.484s; average 205319.8477476063 KB/s

          with patch:
          11/06/05 20:35:45 INFO hdfs.TestParallelRead: === Report: 4 threads read 2626843 KB (across 1 file(s)) in 10.208s; average 257331.7985893417 KB/s
          11/06/05 20:36:13 INFO hdfs.TestParallelRead: === Report: 16 threads read 10492178 KB (across 1 file(s)) in 27.046s; average 387938.25334615103 KB/s
          11/06/05 20:36:25 INFO hdfs.TestParallelRead: === Report: 8 threads read 5236253

          Show
          Todd Lipcon added a comment - Attached patch rebased against trunk and improved in the following ways: moved SocketCache instance to DFSClient instead of being static fixed bugs that could have caused concurrent modification exceptions in SocketCache itself – now uses iterator remove methods in eviction code. fixed SocketCache.size() to be synchronized renamed BlockSender.blockReadFully to sentEntireByteRange, and made it get set to true as soon as it has sent the entire requested length. This was necessary so that the client and server agree on when a status code will be expected. DataXceiver: renamed the sockReuseTimeout to socketKeepaliveTimeout - I think this is a slightly clearer name Fixed some assertions in new tests to use junit assertions instead of Java assertions (as suggested in an above comment) changed TestParallelRead to disable the clienttrace log, since jstack was showing that it was causing a ton of contention couple of misc style cleanups I also ran TestParallelRead before and after the patch, after bumping the N_ITERATIONS to 10240 and changing the proportion of non-positional reads to 0. The results are: without patch : 11/06/05 20:32:54 INFO hdfs.TestParallelRead: === Report: 4 threads read 2619994 KB (across 1 file(s)) in 25.762s; average 101699.94565639313 KB/s 11/06/05 20:33:34 INFO hdfs.TestParallelRead: === Report: 16 threads read 10470506 KB (across 1 file(s)) in 40.583s; average 258002.26695907154 KB/s 11/06/05 20:34:00 INFO hdfs.TestParallelRead: === Report: 8 threads read 5232371 KB (across 2 file(s)) in 25.484s; average 205319.8477476063 KB/s with patch : 11/06/05 20:35:45 INFO hdfs.TestParallelRead: === Report: 4 threads read 2626843 KB (across 1 file(s)) in 10.208s; average 257331.7985893417 KB/s 11/06/05 20:36:13 INFO hdfs.TestParallelRead: === Report: 16 threads read 10492178 KB (across 1 file(s)) in 27.046s; average 387938.25334615103 KB/s 11/06/05 20:36:25 INFO hdfs.TestParallelRead: === Report: 8 threads read 5236253
          Hide
          Todd Lipcon added a comment -

          oops, the last line of my benchmark results got truncated. It should read:

          without patch:
          11/06/05 20:32:54 INFO hdfs.TestParallelRead: === Report: 4 threads read 2619994 KB (across 1 file(s)) in 25.762s; average 101699.94565639313 KB/s
          11/06/05 20:33:34 INFO hdfs.TestParallelRead: === Report: 16 threads read 10470506 KB (across 1 file(s)) in 40.583s; average 258002.26695907154 KB/s
          11/06/05 20:34:00 INFO hdfs.TestParallelRead: === Report: 8 threads read 5232371 KB (across 2 file(s)) in 25.484s; average 205319.8477476063 KB/s

          with patch:
          11/06/05 20:35:45 INFO hdfs.TestParallelRead: === Report: 4 threads read 2626843 KB (across 1 file(s)) in 10.208s; average 257331.7985893417 KB/s
          11/06/05 20:36:13 INFO hdfs.TestParallelRead: === Report: 16 threads read 10492178 KB (across 1 file(s)) in 27.046s; average 387938.25334615103 KB/s
          11/06/05 20:36:25 INFO hdfs.TestParallelRead: === Report: 8 threads read 5236253 KB (across 2 file(s)) in 12.447s; average 420683.93990519806 KB/s

          Show
          Todd Lipcon added a comment - oops, the last line of my benchmark results got truncated. It should read: without patch : 11/06/05 20:32:54 INFO hdfs.TestParallelRead: === Report: 4 threads read 2619994 KB (across 1 file(s)) in 25.762s; average 101699.94565639313 KB/s 11/06/05 20:33:34 INFO hdfs.TestParallelRead: === Report: 16 threads read 10470506 KB (across 1 file(s)) in 40.583s; average 258002.26695907154 KB/s 11/06/05 20:34:00 INFO hdfs.TestParallelRead: === Report: 8 threads read 5232371 KB (across 2 file(s)) in 25.484s; average 205319.8477476063 KB/s with patch : 11/06/05 20:35:45 INFO hdfs.TestParallelRead: === Report: 4 threads read 2626843 KB (across 1 file(s)) in 10.208s; average 257331.7985893417 KB/s 11/06/05 20:36:13 INFO hdfs.TestParallelRead: === Report: 16 threads read 10492178 KB (across 1 file(s)) in 27.046s; average 387938.25334615103 KB/s 11/06/05 20:36:25 INFO hdfs.TestParallelRead: === Report: 8 threads read 5236253 KB (across 2 file(s)) in 12.447s; average 420683.93990519806 KB/s
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12481534/hdfs-941.txt
          against trunk revision 1131331.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 12 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.hdfs.server.namenode.metrics.TestNameNodeMetrics
          org.apache.hadoop.hdfs.TestDFSClientRetries

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/711//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/711//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/711//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12481534/hdfs-941.txt against trunk revision 1131331. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 12 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.server.namenode.metrics.TestNameNodeMetrics org.apache.hadoop.hdfs.TestDFSClientRetries +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/711//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/711//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/711//console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          Updated patch to fix failing tests:

          • TestDFSClientRetries works by setting the max xceiver count to something very low like 2, and then hammering it with a lot of clients, to make sure the randomized backoff lets them all eventually succeed. With even a short keepalive on the datanode side, the transceivers were occupied for too long. Set the DN keepalive config to 0 for this test case, and modified the DN code so that a config setting 0 disables the behavior.
          • TestNameNodeMetrics was looking at the cluster "load" (read: xceiver count) as one of the metrics. This was therefore sensitive to timing since it dependended on whether the DN heartbeated during the keepalive window or after it had expired. I removed this assert since the other metrics already do good coverage.
          Show
          Todd Lipcon added a comment - Updated patch to fix failing tests: TestDFSClientRetries works by setting the max xceiver count to something very low like 2, and then hammering it with a lot of clients, to make sure the randomized backoff lets them all eventually succeed. With even a short keepalive on the datanode side, the transceivers were occupied for too long. Set the DN keepalive config to 0 for this test case, and modified the DN code so that a config setting 0 disables the behavior. TestNameNodeMetrics was looking at the cluster "load" (read: xceiver count) as one of the metrics. This was therefore sensitive to timing since it dependended on whether the DN heartbeated during the keepalive window or after it had expired. I removed this assert since the other metrics already do good coverage.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12481594/hdfs-941.txt
          against trunk revision 1132698.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 18 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/720//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/720//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/720//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12481594/hdfs-941.txt against trunk revision 1132698. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 18 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/720//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/720//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/720//console This message is automatically generated.
          Hide
          stack added a comment -

          I took a look at patch. It looks good to me. Minor comments below. Meantime I've patched it into an hadoop 0.22 and am running a loading on it overnight to see if can find probs.

          What is this about?

          + <dependency org="com.google.collections" name="google-collections" rev="$

          {google-collections.version}

          " conf="common->default"/>

          When I go to the google-collections home page it says:

          This library was renamed to Guava!
          What you see here is ancient and unmaintained. Do not use it.
          

          Nice doc. changes in BlockReader.

          If you make another version of this patch, change the mentions of getEOS in comments to be 'eos' to match the change of variable name.

          When you create a socket inside in getBlockReader, you've added this:

           469 +        sock.setTcpNoDelay(true);    
          

          to the socket config before connect. That is intentional? (This is new with this patch. Also, old code used set timer after making connection – which seems off... in your patch you set timeout then connect).

          You think 16 a good number for the socket cache (doesn't seem easily chanageable)?

          Nice cleanup of description in DataNode.java

          One note is that this patch looks 'safe'; we default to closing the connection if anything untoward which should be just the behavior DN had before this patch.

          TestParallelRead is sweet.

          Show
          stack added a comment - I took a look at patch. It looks good to me. Minor comments below. Meantime I've patched it into an hadoop 0.22 and am running a loading on it overnight to see if can find probs. What is this about? + <dependency org="com.google.collections" name="google-collections" rev="$ {google-collections.version} " conf="common->default"/> When I go to the google-collections home page it says: This library was renamed to Guava! What you see here is ancient and unmaintained. Do not use it. Nice doc. changes in BlockReader. If you make another version of this patch, change the mentions of getEOS in comments to be 'eos' to match the change of variable name. When you create a socket inside in getBlockReader, you've added this: 469 + sock.setTcpNoDelay( true ); to the socket config before connect. That is intentional? (This is new with this patch. Also, old code used set timer after making connection – which seems off... in your patch you set timeout then connect). You think 16 a good number for the socket cache (doesn't seem easily chanageable)? Nice cleanup of description in DataNode.java One note is that this patch looks 'safe'; we default to closing the connection if anything untoward which should be just the behavior DN had before this patch. TestParallelRead is sweet.
          Hide
          Kihwal Lee added a comment -

          > You think 16 a good number for the socket cache (doesn't seem easily chanageable)?
          If the client's working set size of data nodes in past several seconds is bigger, it means lower locality. If a lot of clients are doing it, each data node is likely to see less data locality, making page cache less effective. This can make more reads cold and the gain from caching connections will start to diminish. Is 16 a good number? IMO, it may actually be too big for typical use cases, but is small enough to not cause trouble.

          Show
          Kihwal Lee added a comment - > You think 16 a good number for the socket cache (doesn't seem easily chanageable)? If the client's working set size of data nodes in past several seconds is bigger, it means lower locality. If a lot of clients are doing it, each data node is likely to see less data locality, making page cache less effective. This can make more reads cold and the gain from caching connections will start to diminish. Is 16 a good number? IMO, it may actually be too big for typical use cases, but is small enough to not cause trouble.
          Hide
          stack added a comment -

          +1 on commit. Have run this patch first in a "light" random read loading over night and then over this morning with a 'heavy' random read + write loading on 5 node cluster. Discernible perf improvement (caching involved so hard to say for sure but I see 20% improvement if just random reads).

          @Kihwal Fair enough.

          Show
          stack added a comment - +1 on commit. Have run this patch first in a "light" random read loading over night and then over this morning with a 'heavy' random read + write loading on 5 node cluster. Discernible perf improvement (caching involved so hard to say for sure but I see 20% improvement if just random reads). @Kihwal Fair enough.
          Hide
          stack added a comment -

          Oh, just to say that I don't seem hdfs-level complaints in server or client side and that I tested on patched 0.22 hadoop.

          Show
          stack added a comment - Oh, just to say that I don't seem hdfs-level complaints in server or client side and that I tested on patched 0.22 hadoop.
          Hide
          stack added a comment -

          Here is version of patch that applies to 0.22; a couple of imports and constants were rejected.

          Show
          stack added a comment - Here is version of patch that applies to 0.22; a couple of imports and constants were rejected.
          Hide
          stack added a comment -

          This patch should be applied to hadoop 0.22. It'd be an incentive for hbase users to upgrade to hadoop 0.22.

          Show
          stack added a comment - This patch should be applied to hadoop 0.22. It'd be an incentive for hbase users to upgrade to hadoop 0.22.
          Hide
          Todd Lipcon added a comment -

          Stack, thanks a million for the cluster testing and review!! I will get to your review feedback later this afternoon and post a final patch.

          Show
          Todd Lipcon added a comment - Stack, thanks a million for the cluster testing and review!! I will get to your review feedback later this afternoon and post a final patch.
          Hide
          stack added a comment -

          On occasion I see these new additions to the datanode log:

          2011-06-08 12:37:20,478 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Client did not send a valid status code after reading. Will close connection.
          2011-06-08 12:37:20,480 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Client did not send a valid status code after reading. Will close connection.
          2011-06-08 12:37:20,482 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Client did not send a valid status code after reading. Will close connection.
          2011-06-08 12:37:20,483 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Client did not send a valid status code after reading. Will close connection.
          

          Should these be logged as DEBUG and not ERROR?

          I see this too, don't think it related:

          2011-06-08 12:40:09,642 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_-2049668997072761677_6556 src: /10.4.9.34:36343 dest: /10.4.9.34:10010
          2011-06-08 12:40:09,661 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: BlockSender.sendChunks() exception: java.io.IOException: Connection reset by peer
                  at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
                  at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:415)
                  at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:516)
                  at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:204)
                  at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:392)
                  at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:481)
                  at org.apache.hadoop.hdfs.server.datanode.DataXceiver.opReadBlock(DataXceiver.java:237)
                  at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.opReadBlock(DataTransferProtocol.java:356)
                  at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.processOp(DataTransferProtocol.java:328)
                  at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:169)
                  at java.lang.Thread.run(Thread.java:662)
          

          Odd is that this is machine talking to itself.

          Show
          stack added a comment - On occasion I see these new additions to the datanode log: 2011-06-08 12:37:20,478 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Client did not send a valid status code after reading. Will close connection. 2011-06-08 12:37:20,480 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Client did not send a valid status code after reading. Will close connection. 2011-06-08 12:37:20,482 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Client did not send a valid status code after reading. Will close connection. 2011-06-08 12:37:20,483 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Client did not send a valid status code after reading. Will close connection. Should these be logged as DEBUG and not ERROR? I see this too, don't think it related: 2011-06-08 12:40:09,642 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_-2049668997072761677_6556 src: /10.4.9.34:36343 dest: /10.4.9.34:10010 2011-06-08 12:40:09,661 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: BlockSender.sendChunks() exception: java.io.IOException: Connection reset by peer at sun.nio.ch.FileChannelImpl.transferTo0(Native Method) at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:415) at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:516) at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:204) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:392) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:481) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.opReadBlock(DataXceiver.java:237) at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.opReadBlock(DataTransferProtocol.java:356) at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.processOp(DataTransferProtocol.java:328) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:169) at java.lang. Thread .run( Thread .java:662) Odd is that this is machine talking to itself.
          Hide
          Todd Lipcon added a comment -

          Hey Stack, are you sure you got the latest patch applied? the "did not send a valid status code" bit was changed to a WARN in the latest patch, and I also addressed a bug that would cause it to happen more often than it used to.

          I agree that the warning in sendChunks is unrelated - I've seen that in trunk for a while before this patch.

          Show
          Todd Lipcon added a comment - Hey Stack, are you sure you got the latest patch applied? the "did not send a valid status code" bit was changed to a WARN in the latest patch, and I also addressed a bug that would cause it to happen more often than it used to. I agree that the warning in sendChunks is unrelated - I've seen that in trunk for a while before this patch.
          Hide
          stack added a comment -

          OK. Looks like I was running the just-previous. Let me redo loadings. On the IOE sendChunks, this is in 0.22. I should make an issue for it?

          Show
          stack added a comment - OK. Looks like I was running the just-previous. Let me redo loadings. On the IOE sendChunks, this is in 0.22. I should make an issue for it?
          Hide
          stack added a comment -

          Or, hang on, let me indeed verify 0.22 has this minus the 941 patch.

          Show
          stack added a comment - Or, hang on, let me indeed verify 0.22 has this minus the 941 patch.
          Hide
          Kihwal Lee added a comment -

          I will try putting some load in a cluster with this patch + trunk.

          Show
          Kihwal Lee added a comment - I will try putting some load in a cluster with this patch + trunk.
          Hide
          Todd Lipcon added a comment -

          New patch with response to stack's review:

          ... google-collections ...

          I think you were looking at a previous patch. New patch uses guava, which is already included in ivy.xml since a month or two ago

          If you make another version of this patch, change the mentions of getEOS in comments to be 'eos' to match the change of variable name.

          fixed (only found one such mention)

          469 + sock.setTcpNoDelay(true);

          Sharp eyes. The reason we need TCP_NODELAY now whereas we didn't before is that, without it, we'll hit this nasty interaction between nagle's algorithm and delayed ACKs. Basically, our interaction pattern now looks like:

          C -> DN: read block X
          DN -> C: data
          C -> DN: status OK
          C -> DN: read block Y

          The fact that we send two packets in a row triggers Nagle's algorithm - it won't send the read request for block Y until it's received an ACK for its "status OK" packet, or a length delay elapses (40ms I think in Linux). The DN isn't sending the ACK for "status OK" because it's gone into delayed ACK mode. So, throughput drops by a factor of 20x or more. I confirmed this with TestParallelRead - I saw a 12MB/sec result vs the 274M/sec with NODELAY on.

          NODELAY fixes this by forcing the "read block Y" request to be sent immediately even though the ACK for the prior packet is still outstanding.

          To be sure that the performance gains in this patch aren't just from NODELAY, I added NODELAY to unpatched trunk and ran TestParallelRead. It made no performance difference when it wasn't combined with connection reuse.

          I'll add a comment to the code explaining why the nodelay is crucial for performance.

          Also, old code used set timer after making connection

          Fixed, I don't think this was for any particular reason.

          You think 16 a good number for the socket cache (doesn't seem easily chanageable)?

          16 seems like a decent default, but I made it configurable just in case it bites anyone in production. Added dfs.client.socketcache.capacity as an undocumented parameter, since we don't expect it to be tuned.

          Show
          Todd Lipcon added a comment - New patch with response to stack's review: ... google-collections ... I think you were looking at a previous patch. New patch uses guava, which is already included in ivy.xml since a month or two ago If you make another version of this patch, change the mentions of getEOS in comments to be 'eos' to match the change of variable name. fixed (only found one such mention) 469 + sock.setTcpNoDelay(true); Sharp eyes. The reason we need TCP_NODELAY now whereas we didn't before is that, without it, we'll hit this nasty interaction between nagle's algorithm and delayed ACKs. Basically, our interaction pattern now looks like: C -> DN: read block X DN -> C: data C -> DN: status OK C -> DN: read block Y The fact that we send two packets in a row triggers Nagle's algorithm - it won't send the read request for block Y until it's received an ACK for its "status OK" packet, or a length delay elapses (40ms I think in Linux). The DN isn't sending the ACK for "status OK" because it's gone into delayed ACK mode. So, throughput drops by a factor of 20x or more. I confirmed this with TestParallelRead - I saw a 12MB/sec result vs the 274M/sec with NODELAY on. NODELAY fixes this by forcing the "read block Y" request to be sent immediately even though the ACK for the prior packet is still outstanding. To be sure that the performance gains in this patch aren't just from NODELAY, I added NODELAY to unpatched trunk and ran TestParallelRead. It made no performance difference when it wasn't combined with connection reuse. I'll add a comment to the code explaining why the nodelay is crucial for performance. Also, old code used set timer after making connection Fixed, I don't think this was for any particular reason. You think 16 a good number for the socket cache (doesn't seem easily chanageable)? 16 seems like a decent default, but I made it configurable just in case it bites anyone in production. Added dfs.client.socketcache.capacity as an undocumented parameter, since we don't expect it to be tuned.
          Hide
          stack added a comment -

          New patch looks good (nice comment on why NODELAY). Let me test it.

          Show
          stack added a comment - New patch looks good (nice comment on why NODELAY). Let me test it.
          Hide
          stack added a comment -

          Testing the just-previous version of the patch, I don't see 'Client did not send a valid status ...' anymore. I see the "BlockSender.sendChunks() exception: java.io.IOException: Connection reset by peer..." but that is purportedly present here in 0.22 branch absent this patch.

          Show
          stack added a comment - Testing the just-previous version of the patch, I don't see 'Client did not send a valid status ...' anymore. I see the "BlockSender.sendChunks() exception: java.io.IOException: Connection reset by peer..." but that is purportedly present here in 0.22 branch absent this patch.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12481868/hdfs-941.txt
          against trunk revision 1133476.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 18 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.cli.TestHDFSCLI

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/745//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/745//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/745//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12481868/hdfs-941.txt against trunk revision 1133476. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 18 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.cli.TestHDFSCLI +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/745//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/745//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/745//console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          I'd like to commit this tomorrow so long as Stack and Kihwal's testing works out. Woo!

          Show
          Todd Lipcon added a comment - I'd like to commit this tomorrow so long as Stack and Kihwal's testing works out. Woo!
          Hide
          stack added a comment -

          Dang. Did more testing (w/ Todd's help). I backported his patch to 0.22 so could run my loadings. I see this every so often in dn logs 'Got error for OP_READ_BLOCK' (perhaps once every ten minutes per server). The other side of the connection will print 'Client /10.4.9.34did not send a valid status code after reading. Will close connection' (I'll see this latter message much more frequently than the former but it seems fine – we are just closing the connection and moving on w/ no repercussions client-side).

          Here is more context.

          In the datanode log (Look for 'Client /10.4.9.34did not...'):

          2011-06-08 23:39:45,759 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block blk_-1043418802690508828_7206 of size 16207176 from /10.4.9.34:57333
          2011-06-08 23:39:45,759 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block blk_-1043418802690508828_7206 terminating                                  2011-06-08 23:39:45,960 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_5716868613634466961_7207 src: /10.4.14.34:39560 dest: /10.4.9.34:10010
          2011-06-08 23:39:46,301 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block blk_5716868613634466961_7207 of size 29893370 from /10.4.14.34:39560                    2011-06-08 23:39:46,301 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 for block blk_5716868613634466961_7207 terminating
          2011-06-08 23:39:46,326 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_-7242346463849737969_7208 src: /10.4.14.34:39564 dest: /10.4.9.34:10010            2011-06-08 23:39:46,434 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Client /10.4.9.34did not send a valid status code after reading. Will close connection.
          2011-06-08 23:39:46,435 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Client /10.4.9.34did not send a valid status code after reading. Will close connection.                2011-06-08 23:39:46,435 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Client /10.4.9.34did not send a valid status code after reading. Will close connection.
          2011-06-08 23:39:46,435 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Client /10.4.9.34did not send a valid status code after reading. Will close connection.                2011-06-08 23:39:47,837 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block blk_-7242346463849737969_7208 of size 67108864 from /10.4.14.34:39564                   2011-06-08 23:39:47,837 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 for block blk_-7242346463849737969_7208 terminating
          2011-06-08 23:39:47,855 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_7820819556875770048_7208 src: /10.4.14.34:39596 dest: /10.4.9.34:10010             2011-06-08 23:39:49,212 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block blk_7820819556875770048_7208 of size 67108864 from /10.4.14.34:39596
          2011-06-08 23:39:49,212 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 for block blk_7820819556875770048_7208 terminating
          

          In the regionserver log (the client):

          2011-06-08 23:39:45,777 INFO org.apache.hadoop.hbase.regionserver.Store: Completed compaction of 4 file(s) in values of usertable,user617882364,1307559813504.                       e4a9ed69f909762ddba8027cb6438575.; new storefile name=hdfs://sv4borg227:10000/hbase/usertable/e4a9ed69f909762ddba8027cb6438575/values/6552772398789018757, size=143.5m; total size   for store is 488.4m
          2011-06-08 23:39:45,777 INFO org.apache.hadoop.hbase.regionserver.compactions.CompactionRequest: completed compaction: regionName=usertable,user617882364,1307559813504.             e4a9ed69f909762ddba8027cb6438575., storeName=values, fileCount=4, fileSize=175.5m, priority=2, date=Wed Jun 08 23:39:41 PDT 2011; duration=3sec
          2011-06-08 23:39:45,777 DEBUG org.apache.hadoop.hbase.regionserver.compactions.CompactionRequest: CompactSplitThread Status: compaction_queue=(0:0), split_queue=0
          2011-06-08 23:39:46,436 WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /10.4.9.34:10010 for file /hbase/usertable/e4a9ed69f909762ddba8027cb6438575/values/              5422279471660943029 for block blk_1325488162553537841_6905:java.io.IOException: Got error for OP_READ_BLOCK, self=/10.4.9.34:57345, remote=/10.4.9.34:10010, for file /hbase/        usertable/e4a9ed69f909762ddba8027cb6438575/values/5422279471660943029, for block 1325488162553537841_6905
              at org.apache.hadoop.hdfs.BlockReader.newBlockReader(BlockReader.java:437)
              at org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:727)
              at org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:618)
              at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:781)
              at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:51)
              at org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:101)
              at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
              at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
              at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:122)
              at org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1139)
              at org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:1081)
          
          
          Show
          stack added a comment - Dang. Did more testing (w/ Todd's help). I backported his patch to 0.22 so could run my loadings. I see this every so often in dn logs 'Got error for OP_READ_BLOCK' (perhaps once every ten minutes per server). The other side of the connection will print 'Client /10.4.9.34did not send a valid status code after reading. Will close connection' (I'll see this latter message much more frequently than the former but it seems fine – we are just closing the connection and moving on w/ no repercussions client-side). Here is more context. In the datanode log (Look for 'Client /10.4.9.34did not...'): 2011-06-08 23:39:45,759 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block blk_-1043418802690508828_7206 of size 16207176 from /10.4.9.34:57333 2011-06-08 23:39:45,759 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block blk_-1043418802690508828_7206 terminating 2011-06-08 23:39:45,960 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_5716868613634466961_7207 src: /10.4.14.34:39560 dest: /10.4.9.34:10010 2011-06-08 23:39:46,301 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block blk_5716868613634466961_7207 of size 29893370 from /10.4.14.34:39560 2011-06-08 23:39:46,301 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 for block blk_5716868613634466961_7207 terminating 2011-06-08 23:39:46,326 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_-7242346463849737969_7208 src: /10.4.14.34:39564 dest: /10.4.9.34:10010 2011-06-08 23:39:46,434 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Client /10.4.9.34did not send a valid status code after reading. Will close connection. 2011-06-08 23:39:46,435 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Client /10.4.9.34did not send a valid status code after reading. Will close connection. 2011-06-08 23:39:46,435 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Client /10.4.9.34did not send a valid status code after reading. Will close connection. 2011-06-08 23:39:46,435 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Client /10.4.9.34did not send a valid status code after reading. Will close connection. 2011-06-08 23:39:47,837 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block blk_-7242346463849737969_7208 of size 67108864 from /10.4.14.34:39564 2011-06-08 23:39:47,837 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 for block blk_-7242346463849737969_7208 terminating 2011-06-08 23:39:47,855 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_7820819556875770048_7208 src: /10.4.14.34:39596 dest: /10.4.9.34:10010 2011-06-08 23:39:49,212 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block blk_7820819556875770048_7208 of size 67108864 from /10.4.14.34:39596 2011-06-08 23:39:49,212 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 for block blk_7820819556875770048_7208 terminating In the regionserver log (the client): 2011-06-08 23:39:45,777 INFO org.apache.hadoop.hbase.regionserver.Store: Completed compaction of 4 file(s) in values of usertable,user617882364,1307559813504. e4a9ed69f909762ddba8027cb6438575.; new storefile name=hdfs: //sv4borg227:10000/hbase/usertable/e4a9ed69f909762ddba8027cb6438575/values/6552772398789018757, size=143.5m; total size for store is 488.4m 2011-06-08 23:39:45,777 INFO org.apache.hadoop.hbase.regionserver.compactions.CompactionRequest: completed compaction: regionName=usertable,user617882364,1307559813504. e4a9ed69f909762ddba8027cb6438575., storeName=values, fileCount=4, fileSize=175.5m, priority=2, date=Wed Jun 08 23:39:41 PDT 2011; duration=3sec 2011-06-08 23:39:45,777 DEBUG org.apache.hadoop.hbase.regionserver.compactions.CompactionRequest: CompactSplitThread Status: compaction_queue=(0:0), split_queue=0 2011-06-08 23:39:46,436 WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /10.4.9.34:10010 for file /hbase/usertable/e4a9ed69f909762ddba8027cb6438575/values/ 5422279471660943029 for block blk_1325488162553537841_6905:java.io.IOException: Got error for OP_READ_BLOCK, self=/10.4.9.34:57345, remote=/10.4.9.34:10010, for file /hbase/ usertable/e4a9ed69f909762ddba8027cb6438575/values/5422279471660943029, for block 1325488162553537841_6905 at org.apache.hadoop.hdfs.BlockReader.newBlockReader(BlockReader.java:437) at org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:727) at org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:618) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:781) at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:51) at org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:101) at java.io.BufferedInputStream.read1(BufferedInputStream.java:256) at java.io.BufferedInputStream.read(BufferedInputStream.java:317) at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:122) at org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1139) at org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:1081)
          Hide
          Todd Lipcon added a comment -

          Stack seems to have turned up some kind of "out of sync" issue between client and server, where the client tries to do another request when the server is still expecting a status message. So, no commit tomorrow

          Show
          Todd Lipcon added a comment - Stack seems to have turned up some kind of "out of sync" issue between client and server, where the client tries to do another request when the server is still expecting a status message. So, no commit tomorrow
          Hide
          Kihwal Lee added a comment -

          My test is still running on trunk, but so far I only see "did not send a valid status code after reading. Will close connection" in special occasions. In my case it's during task init (random readers are map tasks in my test), number of messages exactly matching number of tasks on running on the DN. Afterwards I don't see them.

          Show
          Kihwal Lee added a comment - My test is still running on trunk, but so far I only see "did not send a valid status code after reading. Will close connection" in special occasions. In my case it's during task init (random readers are map tasks in my test), number of messages exactly matching number of tasks on running on the DN. Afterwards I don't see them.
          Hide
          stack added a comment -

          @Kihwal Are you doing any writing at the same time? (I was).

          Show
          stack added a comment - @Kihwal Are you doing any writing at the same time? (I was).
          Hide
          stack added a comment -

          @Kihwal You are on TRUNK and not 0.22? (I wonder if my backport messed up something – Todd doesn't thing so but...)

          Show
          stack added a comment - @Kihwal You are on TRUNK and not 0.22? (I wonder if my backport messed up something – Todd doesn't thing so but...)
          Hide
          Kihwal Lee added a comment -

          It's read-only and yes it's against TRUNK. I put 200 X 170MB files across 8 DNs, dfs.replication=1. There are 200 random readers who are randomly reading from all 200 files. The locality was intentionally reduced to test the socket caching. I will try a R/W test once this one is done.

          Show
          Kihwal Lee added a comment - It's read-only and yes it's against TRUNK. I put 200 X 170MB files across 8 DNs, dfs.replication=1. There are 200 random readers who are randomly reading from all 200 files. The locality was intentionally reduced to test the socket caching. I will try a R/W test once this one is done.
          Hide
          Todd Lipcon added a comment -

          Figured out the bug. DFSInputStream.close() was putting the socket back in the cache even if it wasn't in a reusable state. I added an assert that catches this, and fixed close().

          Attached is just the delta to make it easier for stack to update. Full patch momentarily.

          Show
          Todd Lipcon added a comment - Figured out the bug. DFSInputStream.close() was putting the socket back in the cache even if it wasn't in a reusable state. I added an assert that catches this, and fixed close(). Attached is just the delta to make it easier for stack to update. Full patch momentarily.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12481896/hdfs-941.txt
          against trunk revision 1133476.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 18 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.cli.TestHDFSCLI
          org.apache.hadoop.hdfs.TestHDFSTrash

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/748//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/748//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/748//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12481896/hdfs-941.txt against trunk revision 1133476. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 18 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.cli.TestHDFSCLI org.apache.hadoop.hdfs.TestHDFSTrash +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/748//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/748//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/748//console This message is automatically generated.
          Hide
          Kihwal Lee added a comment -

          Good catch and fix! I took a close look the open connections each reader has and sometimes saw more than one connections to a same DN. I will see if that is fixed with the Todd's fix. Otherwise I will look further to determine if it is an issue.

          The test I did was primarily for exercising the socket cache itself. To make it more interesting, the socket cache size was lowered to 3 and dfs.replication to 1. I used the random read test (work in progress) in HDFS-236 on a cluster with 8 data nodes. 200 X 170MB files were created. 200 readers (25 on each DN) read 200 files randomly 64K at a time, jumping among files, for about 6 hours last night. Each reader caches DFSInputStream to all 200 files during its lifetime. Checked the client/server logs afterward.

            • I saw 25 of the "did not send a valid status code after reading. Will close connection" warning at around the task initialization (readers are map tasks) on each data node. They all look local, so they are likely accessing the job conf/jar files that are replicated and available on all eight data nodes, unlike regular data files. Or accessing local DN for some other reasons during this time period. Need to check whether this needs to be fixed.
            • While running, there were 3 ESTABLISHED connections per process and some number of sockets in TIME_WAIT all the time. It means socket cache is not leaking anything, clients are not denied of new connections and eviction is working.
            • The only thing I think a bit odd is the symptom I mentioned above: Duplicate connections in the socket cache. I will try to reproduce with Todd's latest fix.
          Show
          Kihwal Lee added a comment - Good catch and fix! I took a close look the open connections each reader has and sometimes saw more than one connections to a same DN. I will see if that is fixed with the Todd's fix. Otherwise I will look further to determine if it is an issue. The test I did was primarily for exercising the socket cache itself. To make it more interesting, the socket cache size was lowered to 3 and dfs.replication to 1. I used the random read test (work in progress) in HDFS-236 on a cluster with 8 data nodes. 200 X 170MB files were created. 200 readers (25 on each DN) read 200 files randomly 64K at a time, jumping among files, for about 6 hours last night. Each reader caches DFSInputStream to all 200 files during its lifetime. Checked the client/server logs afterward. I saw 25 of the "did not send a valid status code after reading. Will close connection" warning at around the task initialization (readers are map tasks) on each data node. They all look local, so they are likely accessing the job conf/jar files that are replicated and available on all eight data nodes, unlike regular data files. Or accessing local DN for some other reasons during this time period. Need to check whether this needs to be fixed. While running, there were 3 ESTABLISHED connections per process and some number of sockets in TIME_WAIT all the time. It means socket cache is not leaking anything, clients are not denied of new connections and eviction is working. The only thing I think a bit odd is the symptom I mentioned above: Duplicate connections in the socket cache. I will try to reproduce with Todd's latest fix.
          Hide
          Kihwal Lee added a comment -

          Regarding duplicate connections, it makes sense because the inputstream cache is per file and it is quite possible that the clients read blocks belonging to two files that are on the same DN within the window of 3 reads.

          I will look at the one happening during task initialization. May be they just stop reading in the middle of stream by design. Since one message will show up for every new map task, how about changing the message to DEBUG after we are done with testing?

          Show
          Kihwal Lee added a comment - Regarding duplicate connections, it makes sense because the inputstream cache is per file and it is quite possible that the clients read blocks belonging to two files that are on the same DN within the window of 3 reads. I will look at the one happening during task initialization. May be they just stop reading in the middle of stream by design. Since one message will show up for every new map task, how about changing the message to DEBUG after we are done with testing?
          Hide
          Kihwal Lee added a comment -

          I am retesting with Todd's patch and I don't see the messages anymore. Instead, I see more of "BlockSender.sendChunks() exception: java.io.IOException: Broken pipe" from DNs.

          Show
          Kihwal Lee added a comment - I am retesting with Todd's patch and I don't see the messages anymore. Instead, I see more of "BlockSender.sendChunks() exception: java.io.IOException: Broken pipe" from DNs.
          Hide
          stack added a comment -

          @Kihwal I see lots of those sendChunks exceptions but don't think related. Testing latest addition to patch...

          Show
          stack added a comment - @Kihwal I see lots of those sendChunks exceptions but don't think related. Testing latest addition to patch...
          Hide
          Kihwal Lee added a comment -

          OK, I see it's from BlockSender.java:407. It really shouldn't say ERROR since clients can close connections any time, but I agree that this needs to be addressed in a separate work.

          Show
          Kihwal Lee added a comment - OK, I see it's from BlockSender.java:407. It really shouldn't say ERROR since clients can close connections any time, but I agree that this needs to be addressed in a separate work.
          Hide
          stack added a comment -

          +1 on commit for latest version of patch.

          I've been running over the last few hours. I no longer see "Client /10.4.9.34did not send a valid status code after reading" (fix the space on commit) nor do I see the "Got error for OP_READ_BLOCK" exceptions". I have the BlockSender.sendChunks exceptions but they are something else (that we need to fix).

          Nice test you have over there Kihwal!

          My test was a 5 node cluster running hbase on a 451 patched 0.22. The loading was random reads running in MR and then another random-read test being done via a bunch of clients. Cache was disabled so went to FS for all data. I also had random writing going on concurrently.

          Show
          stack added a comment - +1 on commit for latest version of patch. I've been running over the last few hours. I no longer see "Client /10.4.9.34did not send a valid status code after reading" (fix the space on commit) nor do I see the "Got error for OP_READ_BLOCK" exceptions". I have the BlockSender.sendChunks exceptions but they are something else (that we need to fix). Nice test you have over there Kihwal! My test was a 5 node cluster running hbase on a 451 patched 0.22. The loading was random reads running in MR and then another random-read test being done via a bunch of clients. Cache was disabled so went to FS for all data. I also had random writing going on concurrently.
          Hide
          Todd Lipcon added a comment -

          Regarding duplicate connections: also keep in mind that the caching only applies at the read side. So, assuming there's some output as well, there will be a socket for each of those streams.

          I agree we should fix the "sendChunks" error messages separately. I think JD might have filed a JIRA about this a few weeks ago. I'll see if I can dig it up.

          Kihwal: are you +1 on commit now as well?

          Show
          Todd Lipcon added a comment - Regarding duplicate connections: also keep in mind that the caching only applies at the read side. So, assuming there's some output as well, there will be a socket for each of those streams. I agree we should fix the "sendChunks" error messages separately. I think JD might have filed a JIRA about this a few weeks ago. I'll see if I can dig it up. Kihwal: are you +1 on commit now as well?
          Hide
          Kihwal Lee added a comment -

          They were pure readers and didn't write/report anything until the end. I just filed HDFS-2054 for the error message. If you find the other JIRA that was already filed, please dupe one to the other.

          +1 for commit.

          Show
          Kihwal Lee added a comment - They were pure readers and didn't write/report anything until the end. I just filed HDFS-2054 for the error message. If you find the other JIRA that was already filed, please dupe one to the other. +1 for commit.
          Hide
          Todd Lipcon added a comment -

          Committed to trunk.

          I'm 50/50 on whether this should go into the 0.22 branch as well. Like Stack said, it's a nice carrot to help convince HBase users to try out 0.22. But, it's purely an optimization and on the riskier side as far as these things go. I guess I'll ping Nigel?

          Show
          Todd Lipcon added a comment - Committed to trunk. I'm 50/50 on whether this should go into the 0.22 branch as well. Like Stack said, it's a nice carrot to help convince HBase users to try out 0.22. But, it's purely an optimization and on the riskier side as far as these things go. I guess I'll ping Nigel?
          Hide
          Todd Lipcon added a comment -

          Also, big thanks to: bc for authoring the majority of the patch and test cases, Sam Rash for reviews, and Stack and Kihwal for both code review and cluster testing. Great team effort spanning 4 companies!

          Show
          Todd Lipcon added a comment - Also, big thanks to: bc for authoring the majority of the patch and test cases, Sam Rash for reviews, and Stack and Kihwal for both code review and cluster testing. Great team effort spanning 4 companies!
          Hide
          stack added a comment -

          Here is my backport of Todds final patch. Main differences are adding in guava and removal of TestDataXceiver (util works differently in TRUNK).

          Show
          stack added a comment - Here is my backport of Todds final patch. Main differences are adding in guava and removal of TestDataXceiver (util works differently in TRUNK).
          Hide
          stack added a comment -

          Todd, I'll buy you a beer to go 51/49 in favor of 0.22 commit. If Nigel wants me to a make a case, I could do it here or in another issue?

          Show
          stack added a comment - Todd, I'll buy you a beer to go 51/49 in favor of 0.22 commit. If Nigel wants me to a make a case, I could do it here or in another issue?
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12481962/941.22.txt
          against trunk revision 1134031.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 21 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/754//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12481962/941.22.txt against trunk revision 1134031. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 21 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/754//console This message is automatically generated.
          Hide
          stack added a comment -

          Forgot --no-prefix.

          Show
          stack added a comment - Forgot --no-prefix.
          Hide
          Eli Collins added a comment -

          Make that two beers (52/48?). I reviewed an earlier version of this patch but if Nigel is game I think it's suitable for 22 as well.

          Show
          Eli Collins added a comment - Make that two beers (52/48?). I reviewed an earlier version of this patch but if Nigel is game I think it's suitable for 22 as well.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12481963/941.22.txt
          against trunk revision 1134031.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 21 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/755//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12481963/941.22.txt against trunk revision 1134031. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 21 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/755//console This message is automatically generated.
          Hide
          stack added a comment -

          So, that would leave 48 beers that I need to buy (And Nigel probably wants two) – I can get a keg?

          Show
          stack added a comment - So, that would leave 48 beers that I need to buy (And Nigel probably wants two) – I can get a keg?
          Hide
          stack added a comment -

          Yeah, my 0.22 version fails against trunk (trunk already has guava, etc.)

          Show
          stack added a comment - Yeah, my 0.22 version fails against trunk (trunk already has guava, etc.)
          Hide
          Nigel Daley added a comment -

          +1 for 0.22.

          Show
          Nigel Daley added a comment - +1 for 0.22.
          Hide
          Todd Lipcon added a comment -

          Cool, I will review and check in Stack's backport tomorrow.

          Show
          Todd Lipcon added a comment - Cool, I will review and check in Stack's backport tomorrow.
          Hide
          Kihwal Lee added a comment -

          One thing I noticed is, Socket.isConnected() cannot be used for checking the connection status in this case. It returns false until the connection is made and then stays true after that. It will never return false after the initial connection is successfully made. Socket.isClosed() or SocketChannel.isOpen() should be used instead, assuming someone is handling SocketException and does Socket.close() or SocketChannel.close(). It seems the op handlers in DataXceiver are diligently using IOUtils.closeStream(), which will invoke SocketChannel.close().

          - } while (s.isConnected() && socketKeepaliveTimeout > 0);
          + } while (s.isConnected() && !s.isClosed() && socketKeepaliveTimeout > 0);
          

          Sorry for spotting this late. I just realized it while looking at HDFS-2054.

          Show
          Kihwal Lee added a comment - One thing I noticed is, Socket.isConnected() cannot be used for checking the connection status in this case. It returns false until the connection is made and then stays true after that. It will never return false after the initial connection is successfully made. Socket.isClosed() or SocketChannel.isOpen() should be used instead, assuming someone is handling SocketException and does Socket.close() or SocketChannel.close(). It seems the op handlers in DataXceiver are diligently using IOUtils.closeStream(), which will invoke SocketChannel.close(). - } while (s.isConnected() && socketKeepaliveTimeout > 0); + } while (s.isConnected() && !s.isClosed() && socketKeepaliveTimeout > 0); Sorry for spotting this late. I just realized it while looking at HDFS-2054 .
          Hide
          Todd Lipcon added a comment -

          Hey Kihwal. Nice find. Mind filing a new JIRA for this? I think it should be a minor thing, since the next time around the loop, it will just the IOE trying to read the next operation anyway, right?

          Show
          Todd Lipcon added a comment - Hey Kihwal. Nice find. Mind filing a new JIRA for this? I think it should be a minor thing, since the next time around the loop, it will just the IOE trying to read the next operation anyway, right?
          Hide
          Kihwal Lee added a comment -

          HDFS-2071 was filed.

          Show
          Kihwal Lee added a comment - HDFS-2071 was filed.
          Hide
          Todd Lipcon added a comment -

          Hey Stack. I just looked over your patch for 0.22. The only thing I noticed is that it no longer calls "verifiedByClient()" – this is a change that happened in trunk with HDFS-1655. Are we OK with removing this from 0.22?

          Show
          Todd Lipcon added a comment - Hey Stack. I just looked over your patch for 0.22. The only thing I noticed is that it no longer calls "verifiedByClient()" – this is a change that happened in trunk with HDFS-1655 . Are we OK with removing this from 0.22?
          Hide
          stack added a comment -

          I should put it back. Give me a sec...

          Show
          stack added a comment - I should put it back. Give me a sec...
          Hide
          stack added a comment -

          Here's new version that will call verifiedByClient IFF return from client is CHECKSUM_OK (Comparing old and new, the isBlockReadFully would look to have a lesser semantic than disSendEntireByteRange). I adde the else if in the below:

          -        if (DataTransferProtocol.Status.read(in) == CHECKSUM_OK) {
          -          if (blockSender.isBlockReadFully() && datanode.blockScanner != null) {
          -            datanode.blockScanner.verifiedByClient(block);
          +      if (blockSender.didSendEntireByteRange()) {
          +        // If we sent the entire range, then we should expect the client
          +        // to respond with a Status enum.
          +        try {
          +          DataTransferProtocol.Status stat = DataTransferProtocol.Status.read(in);
          +          if (stat == null) {
          +            LOG.warn("Client " + s.getInetAddress() + "did not send a valid status " +
          +                     "code after reading. Will close connection.");
          +            IOUtils.closeStream(out);
          +          } else if (stat == CHECKSUM_OK) {
          +            if (datanode.blockScanner != null) {
          +              datanode.blockScanner.verifiedByClient(block);
          +            }
                     }
          
          Show
          stack added a comment - Here's new version that will call verifiedByClient IFF return from client is CHECKSUM_OK (Comparing old and new, the isBlockReadFully would look to have a lesser semantic than disSendEntireByteRange). I adde the else if in the below: - if (DataTransferProtocol.Status.read(in) == CHECKSUM_OK) { - if (blockSender.isBlockReadFully() && datanode.blockScanner != null ) { - datanode.blockScanner.verifiedByClient(block); + if (blockSender.didSendEntireByteRange()) { + // If we sent the entire range, then we should expect the client + // to respond with a Status enum . + try { + DataTransferProtocol.Status stat = DataTransferProtocol.Status.read(in); + if (stat == null ) { + LOG.warn( "Client " + s.getInetAddress() + "did not send a valid status " + + "code after reading. Will close connection." ); + IOUtils.closeStream(out); + } else if (stat == CHECKSUM_OK) { + if (datanode.blockScanner != null ) { + datanode.blockScanner.verifiedByClient(block); + } }
          Hide
          Todd Lipcon added a comment -

          Hey Stack. I still don't think this is quite right – it will now call verifiedByClient() if the client read the entire byterange, even if the byterange didn't cover the whole block. I think we need if (datanode.blockScanner != null && blockSender.isBlockReadFully()). Also, can you add back TestDataXceiver? I think that test case would catch this bug.

          Show
          Todd Lipcon added a comment - Hey Stack. I still don't think this is quite right – it will now call verifiedByClient() if the client read the entire byterange, even if the byterange didn't cover the whole block. I think we need if (datanode.blockScanner != null && blockSender.isBlockReadFully()) . Also, can you add back TestDataXceiver? I think that test case would catch this bug.
          Hide
          stack added a comment -

          Thanks for review Todd. There is no isBlockReadFully method anymore; this patch removes it. You think I should add that back? TestDataXceiver was removed because BlockReaderTestUtil#writeFile signature changed returning List of Blocks instead of byte []. I can hack it around to work.

          Show
          stack added a comment - Thanks for review Todd. There is no isBlockReadFully method anymore; this patch removes it. You think I should add that back? TestDataXceiver was removed because BlockReaderTestUtil#writeFile signature changed returning List of Blocks instead of byte []. I can hack it around to work.
          Hide
          Todd Lipcon added a comment -

          Yea, I think we should add back the blockReadFully variable (in addition to keeping the new sentEntireByteRange variable and its getter).

          Looks like there's a new getFileBlocks() method which can be used after writeFile() to get the block location, and then keep that test around?

          Show
          Todd Lipcon added a comment - Yea, I think we should add back the blockReadFully variable (in addition to keeping the new sentEntireByteRange variable and its getter). Looks like there's a new getFileBlocks() method which can be used after writeFile() to get the block location, and then keep that test around?
          Hide
          stack added a comment -

          I put back TestDataXceiver. It does this:

          -    List<LocatedBlock> blkList = util.writeFile(TEST_FILE, FILE_SIZE_K);
          +    // Create file.
          +    util.writeFile(TEST_FILE, FILE_SIZE_K);
          +    // Now get its blocks.
          +    List<LocatedBlock> blkList = util.getFileBlocks(TEST_FILE, FILE_SIZE_K);
          

          rather than change the writeFile signature (writeFile is used in a few other places so the change would ripple).

          I also added back BlockSender.isBlockReadFully so the tests before we call verifiedByClient are as they were before this patch application:

          -        if (DataTransferProtocol.Status.read(in) == CHECKSUM_OK) {
          -          if (blockSender.isBlockReadFully() && datanode.blockScanner != null) {
          -            datanode.blockScanner.verifiedByClient(block);
          +      if (blockSender.didSendEntireByteRange()) {
          +        // If we sent the entire range, then we should expect the client
          +        // to respond with a Status enum.
          +        try {
          +          DataTransferProtocol.Status stat = DataTransferProtocol.Status.read(in);
          +          if (stat == null) {
          +            LOG.warn("Client " + s.getInetAddress() + "did not send a valid status " +
          +                     "code after reading. Will close connection.");
          +            IOUtils.closeStream(out);
          +          } else if (stat == CHECKSUM_OK) {
          +            if (blockSender.isBlockReadFully() && datanode.blockScanner != null) {
          +              datanode.blockScanner.verifiedByClient(block);
          +            }
                     }
          

          I ran the bundled tests and they pass. Am currently running all.

          Show
          stack added a comment - I put back TestDataXceiver. It does this: - List<LocatedBlock> blkList = util.writeFile(TEST_FILE, FILE_SIZE_K); + // Create file. + util.writeFile(TEST_FILE, FILE_SIZE_K); + // Now get its blocks. + List<LocatedBlock> blkList = util.getFileBlocks(TEST_FILE, FILE_SIZE_K); rather than change the writeFile signature (writeFile is used in a few other places so the change would ripple). I also added back BlockSender.isBlockReadFully so the tests before we call verifiedByClient are as they were before this patch application: - if (DataTransferProtocol.Status.read(in) == CHECKSUM_OK) { - if (blockSender.isBlockReadFully() && datanode.blockScanner != null ) { - datanode.blockScanner.verifiedByClient(block); + if (blockSender.didSendEntireByteRange()) { + // If we sent the entire range, then we should expect the client + // to respond with a Status enum . + try { + DataTransferProtocol.Status stat = DataTransferProtocol.Status.read(in); + if (stat == null ) { + LOG.warn( "Client " + s.getInetAddress() + "did not send a valid status " + + "code after reading. Will close connection." ); + IOUtils.closeStream(out); + } else if (stat == CHECKSUM_OK) { + if (blockSender.isBlockReadFully() && datanode.blockScanner != null ) { + datanode.blockScanner.verifiedByClient(block); + } } I ran the bundled tests and they pass. Am currently running all.
          Hide
          Todd Lipcon added a comment -

          Looks good to me. How'd the test run go?

          Show
          Todd Lipcon added a comment - Looks good to me. How'd the test run go?
          Hide
          stack added a comment -

          Bit odd. These failed when I ran all tests:

          
              [junit] Running org.apache.hadoop.hdfs.TestFileAppend4
              [junit] Tests run: 2, Failures: 0, Errors: 2, Time elapsed: 60.251 sec
          
              [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 104.115 sec
              [junit] Test org.apache.hadoop.hdfs.TestLargeBlock FAILED
          
          
              [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 60.022 sec
              [junit] Test org.apache.hadoop.hdfs.<D-1><D-2>TestWriteConfigurationToDFS FAILED
          
          

          I reran all and only TestLargeBlock fails when I run tests singularly. If I back out the patch, TestLargeBlock fails against clean 0.22 checkout.

          Commit I'd say?

          Show
          stack added a comment - Bit odd. These failed when I ran all tests: [junit] Running org.apache.hadoop.hdfs.TestFileAppend4 [junit] Tests run: 2, Failures: 0, Errors: 2, Time elapsed: 60.251 sec [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 104.115 sec [junit] Test org.apache.hadoop.hdfs.TestLargeBlock FAILED [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 60.022 sec [junit] Test org.apache.hadoop.hdfs.<D-1><D-2>TestWriteConfigurationToDFS FAILED I reran all and only TestLargeBlock fails when I run tests singularly. If I back out the patch, TestLargeBlock fails against clean 0.22 checkout. Commit I'd say?
          Hide
          stack added a comment -

          Or, hang on...(240 minutes) and let me rerun these tests and see if TestFileAppend4 and/or TestWriteConfigurationToDFS fail again.

          Show
          stack added a comment - Or, hang on...(240 minutes) and let me rerun these tests and see if TestFileAppend4 and/or TestWriteConfigurationToDFS fail again.
          Hide
          Konstantin Shvachko added a comment -

          Could anybody please run DFSIO to make sure there is no degradation in sequential ios.

          Show
          Konstantin Shvachko added a comment - Could anybody please run DFSIO to make sure there is no degradation in sequential ios.
          Hide
          Todd Lipcon added a comment -

          Cos: do you have any reason to believe there would be? I believe in benchmarking, but unless there's some reasoning behind the idea, it can take a lot of time that's better spent on other places (eg optimizing sequential IO )

          If I recall correctly, early versions of this patch were indeed benchmarked for sequential IO, where we saw no difference.

          Show
          Todd Lipcon added a comment - Cos: do you have any reason to believe there would be? I believe in benchmarking, but unless there's some reasoning behind the idea, it can take a lot of time that's better spent on other places (eg optimizing sequential IO ) If I recall correctly, early versions of this patch were indeed benchmarked for sequential IO, where we saw no difference.
          Hide
          Todd Lipcon added a comment -

          oops, sorry Konstantin - didn't mean to call you Cos. But my comment stands

          Show
          Todd Lipcon added a comment - oops, sorry Konstantin - didn't mean to call you Cos. But my comment stands
          Hide
          Konstantin Shvachko added a comment -

          Yes in the previous comment there has been some degradation in throughput for sequential io. I just want to make sure there is no degradation for the primary use case with this patch.

          Show
          Konstantin Shvachko added a comment - Yes in the previous comment there has been some degradation in throughput for sequential io. I just want to make sure there is no degradation for the primary use case with this patch.
          Hide
          stack added a comment -

          I reran tests, same three failed. I backed out my patch and the same three failed. So, this patch does not seem to be responsible for these test failures on my machine.

          I'm +1 on commit.

          Show
          stack added a comment - I reran tests, same three failed. I backed out my patch and the same three failed. So, this patch does not seem to be responsible for these test failures on my machine. I'm +1 on commit.
          Hide
          Todd Lipcon added a comment -

          Committed the 0.22 patch. Konstantin: I think if you look at that comment again, you'll see that some of the test runs got faster, some got slower, and all were well within the standard deviation.

          Given the only extra overhead that might be introduced here is a single lookup in the Socketcache, I see no reason to think this would have any negative effect. If you have benchmark results that disagree, please post them.

          Show
          Todd Lipcon added a comment - Committed the 0.22 patch. Konstantin: I think if you look at that comment again, you'll see that some of the test runs got faster, some got slower, and all were well within the standard deviation. Given the only extra overhead that might be introduced here is a single lookup in the Socketcache, I see no reason to think this would have any negative effect. If you have benchmark results that disagree, please post them.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk-Commit #746 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/746/)

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #746 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/746/ )
          Hide
          Konstantin Shvachko added a comment -

          -1 on committing this without the proof of no-degradation to sequential ios.
          Should have done it before, but thought my message was clear.
          Let me know if you want me to uncommit before benchmarks are provided.

          Show
          Konstantin Shvachko added a comment - -1 on committing this without the proof of no-degradation to sequential ios. Should have done it before, but thought my message was clear. Let me know if you want me to uncommit before benchmarks are provided.
          Hide
          stack added a comment -

          @Konstantin

          Convention is that RM says whats in a release and no one else. See his +1 above.

          ...proof of no-degradation to sequential ios.

          What would this test look like? Perf tests done above showed only minor differences ("...well within the standard deviation." as per Todd).

          And if this test can only be committed pending perf evaluation, why single this patch out and not require it of all commits to hdfs?

          Show
          stack added a comment - @Konstantin Convention is that RM says whats in a release and no one else. See his +1 above. ...proof of no-degradation to sequential ios. What would this test look like? Perf tests done above showed only minor differences ("...well within the standard deviation." as per Todd). And if this test can only be committed pending perf evaluation, why single this patch out and not require it of all commits to hdfs?
          Hide
          Kihwal Lee added a comment -

          Perhaps it's confusing because this Jira is seen as Random Vs. Sequential read. But in fact this jira is really about improving short reads and the solution is to reduce the overhead of connection making, which is present in both short and long reads. It is by no means favoring random or short reads. In fact, if the client does typical sequential reads multiple times from the same dn, this patch will help them too. The gain will be bigger if the files are smaller. Sure, there is one time overhead of cache lookup (size: 16), this can be ignored when the read size is sufficiently big. This cache management overhead should show up, in theory, for very small cold(connecton-wise) accesses. So far I have only seen gains. But there might be some special chronic cases that this patch actually make read slower. But again I don't belive they are typical use cases. Having said that, I think it is reasonable to run tests against the latest patch and make sure there is no regression in performance. Uncommitting now may do more harm than good. Let's see the numbers first and decide what to do.

          Show
          Kihwal Lee added a comment - Perhaps it's confusing because this Jira is seen as Random Vs. Sequential read. But in fact this jira is really about improving short reads and the solution is to reduce the overhead of connection making, which is present in both short and long reads. It is by no means favoring random or short reads. In fact, if the client does typical sequential reads multiple times from the same dn, this patch will help them too. The gain will be bigger if the files are smaller. Sure, there is one time overhead of cache lookup (size: 16), this can be ignored when the read size is sufficiently big. This cache management overhead should show up, in theory, for very small cold(connecton-wise) accesses. So far I have only seen gains. But there might be some special chronic cases that this patch actually make read slower. But again I don't belive they are typical use cases. Having said that, I think it is reasonable to run tests against the latest patch and make sure there is no regression in performance. Uncommitting now may do more harm than good. Let's see the numbers first and decide what to do.
          Hide
          Todd Lipcon added a comment -

          Ran the following benchmark to compare 0.22 before vs after the application of HDFS-941:

          • inserted a 128M file into HDFS
          • read it 50 times using "hadoop fs -cat /file > /dev/null" and the unix "time" utility
          • recompiled with the patch reverted, restarted NN/DN
          • ran same test
          • recompiled with the patch included, restarted NN/DN
          • ran same test
          • recompiled with patch reverted
          • ran same test

          This resulted in 100 samples for each setup, 50 from each run. The following is the output of a t-test for the important variables:

          > t.test(d.22$wall, d.22.with.941$wall)

          Welch Two Sample t-test

          data: d.22$wall and d.22.with.941$wall
          t = -0.4932, df = 174.594, p-value = 0.6225
          alternative hypothesis: true difference in means is not equal to 0
          95 percent confidence interval:
          -0.011002972 0.006602972
          sample estimates:
          mean of x mean of y
          1.1937 1.1959

          > t.test(d.22$user, d.22.with.941$user)

          Welch Two Sample t-test

          data: d.22$user and d.22.with.941$user
          t = -1.5212, df = 197.463, p-value = 0.1298
          alternative hypothesis: true difference in means is not equal to 0
          95 percent confidence interval:
          -0.032378364 0.004178364
          sample estimates:
          mean of x mean of y
          1.3335 1.3476

          that is to say, it failed to reject the null hypothesis... in less stat-heavy terms, there's no statistical evidence that this patch makes the test any slower.

          Show
          Todd Lipcon added a comment - Ran the following benchmark to compare 0.22 before vs after the application of HDFS-941 : inserted a 128M file into HDFS read it 50 times using "hadoop fs -cat /file > /dev/null" and the unix "time" utility recompiled with the patch reverted, restarted NN/DN ran same test recompiled with the patch included, restarted NN/DN ran same test recompiled with patch reverted ran same test This resulted in 100 samples for each setup, 50 from each run. The following is the output of a t-test for the important variables: > t.test(d.22$wall, d.22.with.941$wall) Welch Two Sample t-test data: d.22$wall and d.22.with.941$wall t = -0.4932, df = 174.594, p-value = 0.6225 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.011002972 0.006602972 sample estimates: mean of x mean of y 1.1937 1.1959 > t.test(d.22$user, d.22.with.941$user) Welch Two Sample t-test data: d.22$user and d.22.with.941$user t = -1.5212, df = 197.463, p-value = 0.1298 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.032378364 0.004178364 sample estimates: mean of x mean of y 1.3335 1.3476 that is to say, it failed to reject the null hypothesis... in less stat-heavy terms, there's no statistical evidence that this patch makes the test any slower.
          Hide
          Konstantin Shvachko added a comment -

          > I think it is reasonable to run tests against the latest patch and make sure there is no regression in performance.

          This is exactly what I asked. That is to run DFSIO on 5-node cluster with and without the new patch.
          Here is the command I propose to run for 5 nodes (should have nrFile <= nrNodes).

          TestDFSIO -read -fileSize 10GB -nrFiles 5
          

          You can run -write first to generate data.
          I think this will be representative enough.

          > it failed to reject the null hypothesis

          Great analysis Todd, I am truly impressed. Does everything run on one node? Is there any inter-DN communication then? Also with 128 MB file everything is in RAM, not sure what it measures.

          > Uncommitting now may do more harm than good.

          If my concerns can be addressed without uncommitting then I can hold on to that. Please confirm somebody is doing it.

          Show
          Konstantin Shvachko added a comment - > I think it is reasonable to run tests against the latest patch and make sure there is no regression in performance. This is exactly what I asked. That is to run DFSIO on 5-node cluster with and without the new patch. Here is the command I propose to run for 5 nodes (should have nrFile <= nrNodes). TestDFSIO -read -fileSize 10GB -nrFiles 5 You can run -write first to generate data. I think this will be representative enough. > it failed to reject the null hypothesis Great analysis Todd, I am truly impressed. Does everything run on one node? Is there any inter-DN communication then? Also with 128 MB file everything is in RAM, not sure what it measures. > Uncommitting now may do more harm than good. If my concerns can be addressed without uncommitting then I can hold on to that. Please confirm somebody is doing it.
          Hide
          Todd Lipcon added a comment -

          Konstantin:
          Yes, everything runs on one node. It tests the localhost path, which is identical to what would be tested with your proposed benchmark (nrFiles <= nrNodes means full locality, right?).

          Also with 128 MB file everything is in RAM, not sure what it measures

          It measures the overhead of DFS rather than the cost of IO. Having it not be in RAM makes for a worse test since differences in CPU overhead are lost in the noise of the slow disks.

          Show
          Todd Lipcon added a comment - Konstantin: Yes, everything runs on one node. It tests the localhost path, which is identical to what would be tested with your proposed benchmark (nrFiles <= nrNodes means full locality, right?). Also with 128 MB file everything is in RAM, not sure what it measures It measures the overhead of DFS rather than the cost of IO. Having it not be in RAM makes for a worse test since differences in CPU overhead are lost in the noise of the slow disks.
          Hide
          Kihwal Lee added a comment -

          >Also with 128 MB file everything is in RAM, not sure what it measures.

          If cold reads are performed, the disk i/o will be the bottleneck and that can bury whatever overhead the patch might have introduced in the connection handling under noise. Since the patch didn't change the rest of the serving code, the ideal way of measuring its overhead will be having the dn to do something like null ops. It is effectively putting a magnifying glass on the area where the change has been made. In a normal setup, the next best thing is probably what Todd did.

          In any case, I will run DFSIO as you suggested.

          Show
          Kihwal Lee added a comment - >Also with 128 MB file everything is in RAM, not sure what it measures. If cold reads are performed, the disk i/o will be the bottleneck and that can bury whatever overhead the patch might have introduced in the connection handling under noise. Since the patch didn't change the rest of the serving code, the ideal way of measuring its overhead will be having the dn to do something like null ops. It is effectively putting a magnifying glass on the area where the change has been made. In a normal setup, the next best thing is probably what Todd did. In any case, I will run DFSIO as you suggested.
          Hide
          Todd Lipcon added a comment -

          Another thing to note is that TestDFSIO itself is a pretty flawed test. It exhibits very high variance, and its results are very much dependent on mapreduce's scheduling. For example, dropping the MR heartbeat interval from 3 seconds to 0.3 seconds improved "DFS IO performance" by nearly 2x in some tests I ran a few months ago.

          Show
          Todd Lipcon added a comment - Another thing to note is that TestDFSIO itself is a pretty flawed test. It exhibits very high variance, and its results are very much dependent on mapreduce's scheduling. For example, dropping the MR heartbeat interval from 3 seconds to 0.3 seconds improved "DFS IO performance" by nearly 2x in some tests I ran a few months ago.
          Hide
          dhruba borthakur added a comment -

          My experience with TestDFSIO has been that the variance of its results are higher (especially due to map-reduce software scheduling), and could never capture (at least, for me) small differences in performance of DFS.

          Show
          dhruba borthakur added a comment - My experience with TestDFSIO has been that the variance of its results are higher (especially due to map-reduce software scheduling), and could never capture (at least, for me) small differences in performance of DFS.
          Hide
          Kihwal Lee added a comment -

          Following is 4 consecutive samples taken out in the middle of a larger set. This is the test Konstantin suggested. The std dev of I/O rate seems too high and so does the variation in run times. This is probably not the best way to measure small performance differences, as others have pointed out.

           
          
          ----- TestDFSIO ----- : read
                     Date & time: Thu Jun 16 20:29:41 UTC 2011
                 Number of files: 5
          Total MBytes processed: 51200.0
               Throughput mb/sec: 100.75824515346937
          Average IO rate mb/sec: 136.13864135742188
           IO rate std deviation: 92.17360497645333
              Test exec time sec: 179.953
          
          ----- TestDFSIO ----- : read
                     Date & time: Thu Jun 16 20:31:23 UTC 2011
                 Number of files: 5
          Total MBytes processed: 51200.0
               Throughput mb/sec: 150.92337396277026
          Average IO rate mb/sec: 197.9733428955078
           IO rate std deviation: 106.59864139156599
              Test exec time sec: 99.805
          
          ----- TestDFSIO ----- : read
                     Date & time: Thu Jun 16 20:33:20 UTC 2011
                 Number of files: 5
          Total MBytes processed: 51200.0
               Throughput mb/sec: 115.66831207852795
          Average IO rate mb/sec: 145.11795043945312
           IO rate std deviation: 90.42587602009961
              Test exec time sec: 115.77
          
          ----- TestDFSIO ----- : read
                     Date & time: Thu Jun 16 20:36:31 UTC 2011
                 Number of files: 5
          Total MBytes processed: 51200.0
               Throughput mb/sec: 91.04763462868748
          Average IO rate mb/sec: 127.12406921386719
           IO rate std deviation: 97.86844611649816
              Test exec time sec: 189.954
          

          I ran shorter (64KB) reads so that variances are smaller and the proportion of overhead is larger. For larger reads, the overhead will become less noticeable.

          === BEFORE ===
          
          ----- TestDFSIO ----- : read
                     Date & time: Thu Jun 16 23:00:03 UTC 2011
                 Number of files: 5
          Total MBytes processed: 4.7683716
               Throughput mb/sec: 24.328426438934947
          Average IO rate mb/sec: 24.558759689331055
           IO rate std deviation: 2.474296728169802
              Test exec time sec: 8.444
          
          ----- TestDFSIO ----- : read
                     Date & time: Thu Jun 16 23:00:13 UTC 2011
                 Number of files: 5
          Total MBytes processed: 4.7683716
               Throughput mb/sec: 23.374370500153187
          Average IO rate mb/sec: 23.41034698486328
           IO rate std deviation: 0.9176091691810716
              Test exec time sec: 8.41
          
          ----- TestDFSIO ----- : read
                     Date & time: Thu Jun 16 23:00:23 UTC 2011
                 Number of files: 5
          Total MBytes processed: 4.7683716
               Throughput mb/sec: 24.83526865641276
          Average IO rate mb/sec: 24.873613357543945
           IO rate std deviation: 0.9842580011607321
              Test exec time sec: 8.424
          
          ----- TestDFSIO ----- : read
                     Date & time: Thu Jun 16 23:00:33 UTC 2011
                 Number of files: 5
          Total MBytes processed: 4.7683716
               Throughput mb/sec: 24.57923495892397
          Average IO rate mb/sec: 24.62860679626465
           IO rate std deviation: 1.1144092332035256
              Test exec time sec: 8.41
          
          
          === AFTER ===
          
          ----- TestDFSIO ----- : read
                     Date & time: Thu Jun 16 23:07:34 UTC 2011
                 Number of files: 5
          Total MBytes processed: 4.7683716
               Throughput mb/sec: 23.961666241363066
          Average IO rate mb/sec: 23.970088958740234
           IO rate std deviation: 0.4478642432612885
              Test exec time sec: 8.378
          
          ----- TestDFSIO ----- : read
                     Date & time: Thu Jun 16 23:07:44 UTC 2011
                 Number of files: 5
          Total MBytes processed: 4.7683716
               Throughput mb/sec: 24.57923495892397
          Average IO rate mb/sec: 24.58832550048828
           IO rate std deviation: 0.4712211529700926
              Test exec time sec: 8.394
          
          ----- TestDFSIO ----- : read
                     Date & time: Thu Jun 16 23:07:53 UTC 2011
                 Number of files: 5
          Total MBytes processed: 4.7683716
               Throughput mb/sec: 22.92486337515024
          Average IO rate mb/sec: 22.95939064025879
           IO rate std deviation: 0.8841870285378609
              Test exec time sec: 8.388
          
          ----- TestDFSIO ----- : read
                     Date & time: Thu Jun 16 23:08:03 UTC 2011
                 Number of files: 5
          Total MBytes processed: 4.7683716
               Throughput mb/sec: 24.204931888483504
          Average IO rate mb/sec: 24.234447479248047
           IO rate std deviation: 0.8576845331358649
              Test exec time sec: 8.382
          
          

          I didn't try to do any statistical analysis on it. If somebody wishes to, I can provide a larger set of data.

          Show
          Kihwal Lee added a comment - Following is 4 consecutive samples taken out in the middle of a larger set. This is the test Konstantin suggested. The std dev of I/O rate seems too high and so does the variation in run times. This is probably not the best way to measure small performance differences, as others have pointed out. ----- TestDFSIO ----- : read Date & time: Thu Jun 16 20:29:41 UTC 2011 Number of files: 5 Total MBytes processed: 51200.0 Throughput mb/sec: 100.75824515346937 Average IO rate mb/sec: 136.13864135742188 IO rate std deviation: 92.17360497645333 Test exec time sec: 179.953 ----- TestDFSIO ----- : read Date & time: Thu Jun 16 20:31:23 UTC 2011 Number of files: 5 Total MBytes processed: 51200.0 Throughput mb/sec: 150.92337396277026 Average IO rate mb/sec: 197.9733428955078 IO rate std deviation: 106.59864139156599 Test exec time sec: 99.805 ----- TestDFSIO ----- : read Date & time: Thu Jun 16 20:33:20 UTC 2011 Number of files: 5 Total MBytes processed: 51200.0 Throughput mb/sec: 115.66831207852795 Average IO rate mb/sec: 145.11795043945312 IO rate std deviation: 90.42587602009961 Test exec time sec: 115.77 ----- TestDFSIO ----- : read Date & time: Thu Jun 16 20:36:31 UTC 2011 Number of files: 5 Total MBytes processed: 51200.0 Throughput mb/sec: 91.04763462868748 Average IO rate mb/sec: 127.12406921386719 IO rate std deviation: 97.86844611649816 Test exec time sec: 189.954 I ran shorter (64KB) reads so that variances are smaller and the proportion of overhead is larger. For larger reads, the overhead will become less noticeable. === BEFORE === ----- TestDFSIO ----- : read Date & time: Thu Jun 16 23:00:03 UTC 2011 Number of files: 5 Total MBytes processed: 4.7683716 Throughput mb/sec: 24.328426438934947 Average IO rate mb/sec: 24.558759689331055 IO rate std deviation: 2.474296728169802 Test exec time sec: 8.444 ----- TestDFSIO ----- : read Date & time: Thu Jun 16 23:00:13 UTC 2011 Number of files: 5 Total MBytes processed: 4.7683716 Throughput mb/sec: 23.374370500153187 Average IO rate mb/sec: 23.41034698486328 IO rate std deviation: 0.9176091691810716 Test exec time sec: 8.41 ----- TestDFSIO ----- : read Date & time: Thu Jun 16 23:00:23 UTC 2011 Number of files: 5 Total MBytes processed: 4.7683716 Throughput mb/sec: 24.83526865641276 Average IO rate mb/sec: 24.873613357543945 IO rate std deviation: 0.9842580011607321 Test exec time sec: 8.424 ----- TestDFSIO ----- : read Date & time: Thu Jun 16 23:00:33 UTC 2011 Number of files: 5 Total MBytes processed: 4.7683716 Throughput mb/sec: 24.57923495892397 Average IO rate mb/sec: 24.62860679626465 IO rate std deviation: 1.1144092332035256 Test exec time sec: 8.41 === AFTER === ----- TestDFSIO ----- : read Date & time: Thu Jun 16 23:07:34 UTC 2011 Number of files: 5 Total MBytes processed: 4.7683716 Throughput mb/sec: 23.961666241363066 Average IO rate mb/sec: 23.970088958740234 IO rate std deviation: 0.4478642432612885 Test exec time sec: 8.378 ----- TestDFSIO ----- : read Date & time: Thu Jun 16 23:07:44 UTC 2011 Number of files: 5 Total MBytes processed: 4.7683716 Throughput mb/sec: 24.57923495892397 Average IO rate mb/sec: 24.58832550048828 IO rate std deviation: 0.4712211529700926 Test exec time sec: 8.394 ----- TestDFSIO ----- : read Date & time: Thu Jun 16 23:07:53 UTC 2011 Number of files: 5 Total MBytes processed: 4.7683716 Throughput mb/sec: 22.92486337515024 Average IO rate mb/sec: 22.95939064025879 IO rate std deviation: 0.8841870285378609 Test exec time sec: 8.388 ----- TestDFSIO ----- : read Date & time: Thu Jun 16 23:08:03 UTC 2011 Number of files: 5 Total MBytes processed: 4.7683716 Throughput mb/sec: 24.204931888483504 Average IO rate mb/sec: 24.234447479248047 IO rate std deviation: 0.8576845331358649 Test exec time sec: 8.382 I didn't try to do any statistical analysis on it. If somebody wishes to, I can provide a larger set of data.
          Hide
          Konstantin Shvachko added a comment -

          Kihwal, thanks for doing this.
          For the first set of results is it with or without the patch? Should there be "BEFORE" and "AFTER" sections, as in the second set?

          Show
          Konstantin Shvachko added a comment - Kihwal, thanks for doing this. For the first set of results is it with or without the patch? Should there be "BEFORE" and "AFTER" sections, as in the second set?
          Hide
          Kihwal Lee added a comment -

          I am not 100% sure right now, but I think they are "before" numbers. But with or without the patch, the numbers were all over the place with large std devs.

          Show
          Kihwal Lee added a comment - I am not 100% sure right now, but I think they are "before" numbers. But with or without the patch, the numbers were all over the place with large std devs.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > My experience with TestDFSIO has been that the variance of its results are higher ...

          I also have similar experience.

          Also it is hard to explain what do the "Throughput" and "Average IO rate" really mean. In Kihwal's results, the value could be >100 MB/sec, which seems unreasonable.

          Show
          Tsz Wo Nicholas Sze added a comment - > My experience with TestDFSIO has been that the variance of its results are higher ... I also have similar experience. Also it is hard to explain what do the "Throughput" and "Average IO rate" really mean. In Kihwal's results , the value could be >100 MB/sec, which seems unreasonable.
          Hide
          Konstantin Shvachko added a comment -

          150 MB/sec throughput can be if your data.dir is on a filer, which is your home directory or /tmp. This also explains ridiculous standard deviation, because it competed with with Nicholas running ant test in his home dir, which is on the same filer. Set data.dir to crawlspace3, you will start getting reasonable numbers.
          What is the cluster size?

          Show
          Konstantin Shvachko added a comment - 150 MB/sec throughput can be if your data.dir is on a filer, which is your home directory or /tmp. This also explains ridiculous standard deviation, because it competed with with Nicholas running ant test in his home dir, which is on the same filer. Set data.dir to crawlspace3, you will start getting reasonable numbers. What is the cluster size?
          Hide
          Kihwal Lee added a comment -

          Filer was not used. Cluster has 5 DNs with a separate NN.

          Show
          Kihwal Lee added a comment - Filer was not used. Cluster has 5 DNs with a separate NN.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #699 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/699/)

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #699 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/699/ )
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Hi Konstantin, don't you agree that the result number of "Throughput" and "Average IO rate" do not make much sense? At least the definitions of these two numbers are not clear. Recall that we got >1 GB/sec in the past.

          10/08/07 00:19:55 INFO fs.TestDFSIO: ----- TestDFSIO ----- : read
          10/08/07 00:19:55 INFO fs.TestDFSIO:            Date & time: Sat Aug 07 00:19:55 UTC 2010
          10/08/07 00:19:55 INFO fs.TestDFSIO:        Number of files: 2
          10/08/07 00:19:55 INFO fs.TestDFSIO: Total MBytes processed: 2048
          10/08/07 00:19:55 INFO fs.TestDFSIO:      Throughput mb/sec: 1096.3597430406853
          10/08/07 00:19:55 INFO fs.TestDFSIO: Average IO rate mb/sec: 1143.6881103515625
          10/08/07 00:19:55 INFO fs.TestDFSIO:  IO rate std deviation: 232.655606509863
          10/08/07 00:19:55 INFO fs.TestDFSIO:     Test exec time sec: 28.354
          
          Show
          Tsz Wo Nicholas Sze added a comment - Hi Konstantin, don't you agree that the result number of "Throughput" and "Average IO rate" do not make much sense? At least the definitions of these two numbers are not clear. Recall that we got >1 GB/sec in the past. 10/08/07 00:19:55 INFO fs.TestDFSIO: ----- TestDFSIO ----- : read 10/08/07 00:19:55 INFO fs.TestDFSIO: Date & time: Sat Aug 07 00:19:55 UTC 2010 10/08/07 00:19:55 INFO fs.TestDFSIO: Number of files: 2 10/08/07 00:19:55 INFO fs.TestDFSIO: Total MBytes processed: 2048 10/08/07 00:19:55 INFO fs.TestDFSIO: Throughput mb/sec: 1096.3597430406853 10/08/07 00:19:55 INFO fs.TestDFSIO: Average IO rate mb/sec: 1143.6881103515625 10/08/07 00:19:55 INFO fs.TestDFSIO: IO rate std deviation: 232.655606509863 10/08/07 00:19:55 INFO fs.TestDFSIO: Test exec time sec: 28.354
          Hide
          Konstantin Shvachko added a comment -

          I ran some test myself over the weekend. The results are good. I am getting throughput around 75-78 MB/sec on reads with small (< 2) std.deviation in both cases.
          So I am +1 now on this patch.

          Show
          Konstantin Shvachko added a comment - I ran some test myself over the weekend. The results are good. I am getting throughput around 75-78 MB/sec on reads with small (< 2) std.deviation in both cases. So I am +1 now on this patch.
          Hide
          Konstantin Shvachko added a comment -

          Answers to some issues raised here:

          Stack> RM says whats in a release and no one else.

          We can still talk about technical merits of the implementation, don't we?

          Todd> nrFiles <= nrNodes means full locality, right?

          No. In DFSIO there is no locality, since files that DFSIO reads/writes are not the input of the MR job. Their names are. The reason here is to make sure the job completes in one wave of mappers, and to minimize contention on the drives between tasks.

          I was trying to avoid making this issue yet another discussion about DFSIO, because
          the objective here is to verify that the patch does not introduce regression in performance for sequential ios. If the benchmark I proposed doesn't work for you guys, you can propose a different one.

          Dhruba, Todd, Nicholas> TestDFSIO exhibits very high variance, and its results are dependent on mapreduce's scheduling.

          DFSIO does not depend on the MR scheduling. It depends on the OS memory cache.
          Cluster nodes these days run with 16, 32 GB RAM. So a 10GB file almost entirely can be cached by OS. When you repeatedly run DFSIO then you are not measuring cold IO, but RAM access and communication. And high variation is explained by the fact that some data is cached and some is not.
          For example DFSIO -write is usually very stable with std.dev < 1. This is because it deals with cold writes.
          For DFSIO -read you need to choose file size larger than your RAM. With sequential reads OS cache works as LRU, so if your file is larger than RAM, the OS cache will "forget" blocks from the head of the file, when you get to reading the tail. And when you start reading the file again cache will release oldest pages, which correspond to the higher offset in the file. So it is going to be cold read.
          I had to go to 100GB files, which brought std.dev to < 2, and variation in throughput was around 3%.
          Alternatively you can clean Linux cache on all DataNodes.

          Nicholas> it is hard to explain what do the "Throughput" and "Average IO rate" really mean.

          This post has the definitions.

          Nicholas, I agree with you the results you are posting don't make sense.
          The point is though not to screw the benchmark, but to find the conditions when it reliably measures what you need.

          Show
          Konstantin Shvachko added a comment - Answers to some issues raised here: Stack> RM says whats in a release and no one else. We can still talk about technical merits of the implementation, don't we? Todd> nrFiles <= nrNodes means full locality, right? No. In DFSIO there is no locality, since files that DFSIO reads/writes are not the input of the MR job. Their names are. The reason here is to make sure the job completes in one wave of mappers, and to minimize contention on the drives between tasks. I was trying to avoid making this issue yet another discussion about DFSIO, because the objective here is to verify that the patch does not introduce regression in performance for sequential ios. If the benchmark I proposed doesn't work for you guys, you can propose a different one. Dhruba, Todd, Nicholas> TestDFSIO exhibits very high variance, and its results are dependent on mapreduce's scheduling. DFSIO does not depend on the MR scheduling. It depends on the OS memory cache. Cluster nodes these days run with 16, 32 GB RAM. So a 10GB file almost entirely can be cached by OS. When you repeatedly run DFSIO then you are not measuring cold IO, but RAM access and communication. And high variation is explained by the fact that some data is cached and some is not. For example DFSIO -write is usually very stable with std.dev < 1. This is because it deals with cold writes. For DFSIO -read you need to choose file size larger than your RAM. With sequential reads OS cache works as LRU, so if your file is larger than RAM, the OS cache will "forget" blocks from the head of the file, when you get to reading the tail. And when you start reading the file again cache will release oldest pages, which correspond to the higher offset in the file. So it is going to be cold read. I had to go to 100GB files, which brought std.dev to < 2, and variation in throughput was around 3%. Alternatively you can clean Linux cache on all DataNodes. Nicholas> it is hard to explain what do the "Throughput" and "Average IO rate" really mean. This post has the definitions. Nicholas, I agree with you the results you are posting don't make sense. The point is though not to screw the benchmark, but to find the conditions when it reliably measures what you need.
          Hide
          Arun C Murthy added a comment -

          >TestDFSIO depends on MR scheduling

          I've run into issues with this too - in the past changes to the JT/JIP/scheduler would cause DFS I/O performance to 'regress'!

          We need to re-work TestDFSIO. One way would be to do 'scheduling' in the input-formatt of the test similar to what we did with TeraSort. Even better, stop using a MR job.

          Show
          Arun C Murthy added a comment - >TestDFSIO depends on MR scheduling I've run into issues with this too - in the past changes to the JT/JIP/scheduler would cause DFS I/O performance to 'regress'! We need to re-work TestDFSIO. One way would be to do 'scheduling' in the input-formatt of the test similar to what we did with TeraSort. Even better, stop using a MR job.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-22-branch #70 (See https://builds.apache.org/job/Hadoop-Hdfs-22-branch/70/)

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-22-branch #70 (See https://builds.apache.org/job/Hadoop-Hdfs-22-branch/70/ )

            People

            • Assignee:
              bc Wong
              Reporter:
              Todd Lipcon
            • Votes:
              0 Vote for this issue
              Watchers:
              38 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development